ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 69 - Professional Machine Learning Engineer discussion

Report
Export

You work for a bank and are building a random forest model for fraud detection. You have a dataset that includes transactions, of which 1% are identified as fraudulent. Which data transformation strategy would likely improve the performance of your classifier?

A.
Write your data in TFRecords.
Answers
A.
Write your data in TFRecords.
B.
Z-normalize all the numeric features.
Answers
B.
Z-normalize all the numeric features.
C.
Oversample the fraudulent transaction 10 times.
Answers
C.
Oversample the fraudulent transaction 10 times.
D.
Use one-hot encoding on all categorical features.
Answers
D.
Use one-hot encoding on all categorical features.
Suggested answer: C

Explanation:

Oversampling is a technique for dealing with imbalanced datasets, where the majority class dominates the minority class. It balances the distribution of classes by increasing the number of samples in the minority class. Oversampling can improve the performance of a classifier by reducing the bias towards the majority class and increasing the sensitivity to the minority class.

In this case, the dataset includes transactions, of which 1% are identified as fraudulent. This means that the fraudulent transactions are the minority class and the non-fraudulent transactions are the majority class. A random forest model trained on this dataset might have a low recall for the fraudulent transactions, meaning that it might miss many of them and fail to detect fraud. This could have a high cost for the bank and its customers.

One way to overcome this problem is to oversample the fraudulent transactions 10 times, meaning that each fraudulent transaction is duplicated 10 times in the training dataset. This would increase the proportion of fraudulent transactions from 1% to about 10%, making the dataset more balanced. This would also make the random forest model more aware of the patterns and features that distinguish fraudulent transactions from non-fraudulent ones, and thus improve its accuracy and recall for the minority class.

For more information about oversampling and other techniques for imbalanced data, see the following references:

Random Oversampling and Undersampling for Imbalanced Classification

Exploring Oversampling Techniques for Imbalanced Datasets

asked 18/09/2024
Yunus Emre Akay
33 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first