ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 96 - Professional Machine Learning Engineer discussion

Report
Export

You are creating a deep neural network classification model using a dataset with categorical input values. Certain columns have a cardinality greater than 10,000 unique values. How should you encode these categorical values as input into the model?

A.
Convert each categorical value into an integer value.
Answers
A.
Convert each categorical value into an integer value.
B.
Convert the categorical string data to one-hot hash buckets.
Answers
B.
Convert the categorical string data to one-hot hash buckets.
C.
Map the categorical variables into a vector of boolean values.
Answers
C.
Map the categorical variables into a vector of boolean values.
D.
Convert each categorical value into a run-length encoded string.
Answers
D.
Convert each categorical value into a run-length encoded string.
Suggested answer: B

Explanation:

Option A is incorrect because converting each categorical value into an integer value is not a good way to encode categorical values with high cardinality. This method implies an ordinal relationship between the categories, which may not be true.For example, assigning the values 1, 2, and 3 to the categories ''red'', ''green'', and ''blue'' does not make sense, as there is no inherent order among these colors1.

Option B is correct because converting the categorical string data to one-hot hash buckets is a suitable way to encode categorical values with high cardinality. This method uses a hash function to map each category to a fixed-length vector of binary values, where only one element is 1 and the rest are 0.This method preserves the sparsity and independence of the categories, and reduces the dimensionality of the input space2.

Option C is incorrect because mapping the categorical variables into a vector of boolean values is not a valid way to encode categorical values with high cardinality. This method implies that each category can be represented by a combination of true/false values, which may not be possible for a large number of categories.For example, if there are 10,000 categories, then there are 2^10,000 possible combinations of boolean values, which is impractical to store and process3.

Option D is incorrect because converting each categorical value into a run-length encoded string is not a useful way to encode categorical values with high cardinality. This method compresses a string by replacing consecutive repeated characters with the character and the number of repetitions. For example, ''AAAABBBCC'' becomes ''A4B3C2''.This method does not reduce the dimensionality of the input space, and does not preserve the semantic meaning of the categories4.

Encoding categorical features

One-hot hash buckets

Boolean vector

Run-length encoding

asked 18/09/2024
Sasa Korlat
36 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first