ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 208 - MLS-C01 discussion

Report
Export

A data engineer at a bank is evaluating a new tabular dataset that includes customer data. The data engineer will use the customer data to create a new model to predict customer behavior. After creating a correlation matrix for the variables, the data engineer notices that many of the 100 features are highly correlated with each other.

Which steps should the data engineer take to address this issue? (Choose two.)

A.
Use a linear-based algorithm to train the model.
Answers
A.
Use a linear-based algorithm to train the model.
B.
Apply principal component analysis (PCA).
Answers
B.
Apply principal component analysis (PCA).
C.
Remove a portion of highly correlated features from the dataset.
Answers
C.
Remove a portion of highly correlated features from the dataset.
D.
Apply min-max feature scaling to the dataset.
Answers
D.
Apply min-max feature scaling to the dataset.
E.
Apply one-hot encoding category-based variables.
Answers
E.
Apply one-hot encoding category-based variables.
Suggested answer: B, C

Explanation:

B) Apply principal component analysis (PCA): PCA is a technique that reduces the dimensionality of a dataset by transforming the original features into a smaller set of new features that capture most of the variance in the data. PCA can help address the issue of multicollinearity, which occurs when some features are highly correlated with each other and can cause problems for some machine learning algorithms. By applying PCA, the data engineer can reduce the number of features and remove the redundancy in the data.

C) Remove a portion of highly correlated features from the dataset: Another way to deal with multicollinearity is to manually remove some of the features that are highly correlated with each other. This can help simplify the model and avoid overfitting. The data engineer can use the correlation matrix to identify the features that have a high correlation coefficient (e.g., above 0.8 or below -0.8) and remove one of them from the dataset.References: =

Principal Component Analysis: This is a document from AWS that explains what PCA is, how it works, and how to use it with Amazon SageMaker.

Multicollinearity: This is a document from AWS that describes what multicollinearity is, how to detect it, and how to deal with it.

asked 16/09/2024
Liusel Herrera Garcia
27 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first