ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 5 - MLS-C01 discussion

Report
Export

A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age.

Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population.

How should the Data Scientist correct this issue?

A.
Drop all records from the dataset where age has been set to 0.
Answers
A.
Drop all records from the dataset where age has been set to 0.
B.
Replace the age field value for records with a value of 0 with the mean or median value from the dataset.
Answers
B.
Replace the age field value for records with a value of 0 with the mean or median value from the dataset.
C.
Drop the age feature from the dataset and train the model using the rest of the features.
Answers
C.
Drop the age feature from the dataset and train the model using the rest of the features.
D.
Use k-means clustering to handle missing features.
Answers
D.
Use k-means clustering to handle missing features.
Suggested answer: B

Explanation:

The best way to handle the missing values in the patient age feature is to replace them with the mean or median value from the dataset. This is a common technique for imputing missing values that preserves the overall distribution of the data and avoids introducing bias or reducing the sample size. Dropping the records or the feature would result in losing valuable information and reducing the accuracy of the model. Using k-means clustering would not be appropriate for handling missing values in a single feature, as it is a method for grouping similar data points based on multiple features.

References:

Effective Strategies to Handle Missing Values in Data Analysis

How To Handle Missing Values In Machine Learning Data With Weka

How to handle missing values in Python - Machine Learning Plus

asked 16/09/2024
Renats Fasulins
37 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first