ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 97 - Professional Machine Learning Engineer discussion

Report
Export

You need to train a natural language model to perform text classification on product descriptions that contain millions of examples and 100,000 unique words. You want to preprocess the words individually so that they can be fed into a recurrent neural network. What should you do?

A.
Create a hot-encoding of words, and feed the encodings into your model.
Answers
A.
Create a hot-encoding of words, and feed the encodings into your model.
B.
Identify word embeddings from a pre-trained model, and use the embeddings in your model.
Answers
B.
Identify word embeddings from a pre-trained model, and use the embeddings in your model.
C.
Sort the words by frequency of occurrence, and use the frequencies as the encodings in your model.
Answers
C.
Sort the words by frequency of occurrence, and use the frequencies as the encodings in your model.
D.
Assign a numerical value to each word from 1 to 100,000 and feed the values as inputs in your model.
Answers
D.
Assign a numerical value to each word from 1 to 100,000 and feed the values as inputs in your model.
Suggested answer: B

Explanation:

Option A is incorrect because creating a one-hot encoding of words, and feeding the encodings into your model is not an efficient way to preprocess the words individually for a natural language model.One-hot encoding is a method of representing categorical variables as binary vectors, where each element corresponds to a category and only one element is 1 and the rest are 01.However, this method is not suitable for high-dimensional and sparse data, such as words in a large vocabulary, because it requires a lot of memory and computation, and does not capture the semantic similarity or relationship between words2.

Option B is correct because identifying word embeddings from a pre-trained model, and using the embeddings in your model is a good way to preprocess the words individually for a natural language model.Word embeddings are low-dimensional and dense vectors that represent the meaning and usage of words in a continuous space3.Word embeddings can be learned from a large corpus of text using neural networks, such as word2vec, GloVe, or BERT4.Using pre-trained word embeddings can save time and resources, and improve the performance of the natural language model, especially when the training data is limited or noisy5.

Option C is incorrect because sorting the words by frequency of occurrence, and using the frequencies as the encodings in your model is not a meaningful way to preprocess the words individually for a natural language model. This method implies that the frequency of a word is a good indicator of its importance or relevance, which may not be true. For example, the word ''the'' is very frequent but not very informative, while the word ''unicorn'' is rare but more distinctive. Moreover, this method does not capture the semantic similarity or relationship between words, and may introduce noise or bias into the model.

Option D is incorrect because assigning a numerical value to each word from 1 to 100,000 and feeding the values as inputs in your model is not a valid way to preprocess the words individually for a natural language model. This method implies an ordinal relationship between the words, which may not be true. For example, assigning the values 1, 2, and 3 to the words ''apple'', ''banana'', and ''orange'' does not make sense, as there is no inherent order among these fruits. Moreover, this method does not capture the semantic similarity or relationship between words, and may confuse the model with irrelevant or misleading information.

One-hot encoding

Word embeddings

Word embedding

Pre-trained word embeddings

Using pre-trained word embeddings in a Keras model

[Term frequency]

[Term frequency-inverse document frequency]

[Ordinal variable]

[Encoding categorical features]

asked 18/09/2024
Jered Anderson
40 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first