ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 93 - MLS-C01 discussion

Report
Export

A gaming company has launched an online game where people can start playing for free but they need to pay if they choose to use certain features The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year The company has gathered a labeled dataset from 1 million users

The training dataset consists of 1.000 positive samples (from users who ended up paying within 1 year) and 999.000 negative samples (from users who did not use any paid features) Each data sample consists of 200 features including user age, device, location, and play patterns

Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set However, the prediction results on a test dataset were not satisfactory.

Which of the following approaches should the Data Science team take to mitigate this issue? (Select TWO.)

A.
Add more deep trees to the random forest to enable the model to learn more features.
Answers
A.
Add more deep trees to the random forest to enable the model to learn more features.
B.
indicate a copy of the samples in the test database in the training dataset
Answers
B.
indicate a copy of the samples in the test database in the training dataset
C.
Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.
Answers
C.
Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.
D.
Change the cost function so that false negatives have a higher impact on the cost value than false positives
Answers
D.
Change the cost function so that false negatives have a higher impact on the cost value than false positives
E.
Change the cost function so that false positives have a higher impact on the cost value than false negatives
Answers
E.
Change the cost function so that false positives have a higher impact on the cost value than false negatives
Suggested answer: C, D

Explanation:

The Data Science team is facing a problem of imbalanced data, where the positive class (paid users) is much less frequent than the negative class (non-paid users). This can cause the random forest model to be biased towards the majority class and have poor performance on the minority class. To mitigate this issue, the Data Science team can try the following approaches:

C) Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. This is a technique called data augmentation, which can help increase the size and diversity of the training data for the minority class. This can help the random forest model learn more features and patterns from the positive class and reduce the imbalance ratio.

D) Change the cost function so that false negatives have a higher impact on the cost value than false positives. This is a technique called cost-sensitive learning, which can assign different weights or costs to different classes or errors. By assigning a higher cost to false negatives (predicting non-paid when the user is actually paid), the random forest model can be more sensitive to the minority class and try to minimize the misclassification of the positive class.

References:

Bagging and Random Forest for Imbalanced Classification

Surviving in a Random Forest with Imbalanced Datasets

machine learning - random forest for imbalanced data? - Cross Validated

Biased Random Forest For Dealing With the Class Imbalance Problem

asked 16/09/2024
Gudni Hauksson
42 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first