ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 139 - Professional Machine Learning Engineer discussion

Report
Export

You are developing an ML model using a dataset with categorical input variables. You have randomly split half of the data into training and test sets. After applying one-hot encoding on the categorical variables in the training set, you discover that one categorical variable is missing from the test set. What should you do?

A.
Randomly redistribute the data, with 70% for the training set and 30% for the test set
Answers
A.
Randomly redistribute the data, with 70% for the training set and 30% for the test set
B.
Use sparse representation in the test set
Answers
B.
Use sparse representation in the test set
C.
Apply one-hot encoding on the categorical variables in the test data.
Answers
C.
Apply one-hot encoding on the categorical variables in the test data.
D.
Collect more data representing all categories
Answers
D.
Collect more data representing all categories
Suggested answer: C

Explanation:

The best option for dealing with the missing categorical variable in the test set is to apply one-hot encoding on the categorical variables in the test data. This option has the following advantages:

It ensures the consistency and compatibility of the data format for the ML model, as the one-hot encoding transforms the categorical variables into binary vectors that can be easily processed by the model. By applying one-hot encoding on the categorical variables in the test data, you can match the number and order of the features in the test data with the training data, and avoid any errors or discrepancies in the model prediction.

It preserves the information and relevance of the data for the ML model, as the one-hot encoding creates a separate feature for each possible value of the categorical variable, and assigns a value of 1 to the feature corresponding to the actual value of the variable, and 0 to the rest. By applying one-hot encoding on the categorical variables in the test data, you can retain the original meaning and importance of the categorical variable, and avoid any loss or distortion of the data.

The other options are less optimal for the following reasons:

Option A: Randomly redistributing the data, with 70% for the training set and 30% for the test set, introduces additional complexity and risk. This option requires reshuffling and splitting the data again, which can be tedious and time-consuming. Moreover, this option may not guarantee that the missing categorical variable will be present in the test set, as it depends on the randomness of the data distribution. Furthermore, this option may affect the quality and validity of the ML model, as it may change the data characteristics and patterns that the model has learned from the original training set.

Option B: Using sparse representation in the test set introduces additional overhead and inefficiency. This option requires converting the categorical variables in the test set into sparse vectors, which are vectors that have mostly zero values and only store the indices and values of the non-zero elements. However, using sparse representation in the test set may not be compatible with the ML model, as the model expects the input data to have the same format and dimensionality as the training data, which uses one-hot encoding. Moreover, using sparse representation in the test set may not be efficient or scalable, as it requires additional computation and memory to store and process the sparse vectors.

Option D: Collecting more data representing all categories introduces additional cost and delay. This option requires obtaining and labeling more data that contains the missing categorical variable, which can be expensive and time-consuming. Moreover, this option may not be feasible or necessary, as the missing categorical variable may not be available or relevant for the test data, depending on the data source or the business problem.

asked 18/09/2024
Jesus De Leon Luis
47 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first