ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 205 - MLS-C01 discussion

Report
Export

A data scientist is using the Amazon SageMaker Neural Topic Model (NTM) algorithm to build a model that recommends tags from blog posts. The raw blog post data is stored in an Amazon S3 bucket in JSON format. During model evaluation, the data scientist discovered that the model recommends certain stopwords such as 'a,' 'an,'' and 'the' as tags to certain blog posts, along with a few rare words that are present only in certain blog entries. After a few iterations of tag review with the content team, the data scientist notices that the rare words are unusual but feasible. The data scientist also must ensure that the tag recommendations of the generated model do not include the stopwords.

What should the data scientist do to meet these requirements?

A.
Use the Amazon Comprehend entity recognition API operations. Remove the detected words from the blog post data. Replace the blog post data source in the S3 bucket.
Answers
A.
Use the Amazon Comprehend entity recognition API operations. Remove the detected words from the blog post data. Replace the blog post data source in the S3 bucket.
B.
Run the SageMaker built-in principal component analysis (PCA) algorithm with the blog post data from the S3 bucket as the data source. Replace the blog post data in the S3 bucket with the results of the training job.
Answers
B.
Run the SageMaker built-in principal component analysis (PCA) algorithm with the blog post data from the S3 bucket as the data source. Replace the blog post data in the S3 bucket with the results of the training job.
C.
Use the SageMaker built-in Object Detection algorithm instead of the NTM algorithm for the training job to process the blog post data.
Answers
C.
Use the SageMaker built-in Object Detection algorithm instead of the NTM algorithm for the training job to process the blog post data.
D.
Remove the stop words from the blog post data by using the Count Vectorizer function in the scikit-learn library. Replace the blog post data in the S3 bucket with the results of the vectorizer.
Answers
D.
Remove the stop words from the blog post data by using the Count Vectorizer function in the scikit-learn library. Replace the blog post data in the S3 bucket with the results of the vectorizer.
Suggested answer: D

Explanation:

The data scientist should remove the stop words from the blog post data by using the Count Vectorizer function in the scikit-learn library, and replace the blog post data in the S3 bucket with the results of the vectorizer. This is because:

The Count Vectorizer function is a tool that can convert a collection of text documents to a matrix of token counts1.It also enables the pre-processing of text data prior to generating the vector representation, such as removing accents, converting to lowercase, and filtering out stop words1. By using this function, the data scientist can remove the stop words such as ''a,'' ''an,'' and ''the'' from the blog post data, and obtain a numerical representation of the text that can be used as input for the NTM algorithm.

The NTM algorithm is a neural network-based topic modeling technique that can learn latent topics from a corpus of documents2.It can be used to recommend tags from blog posts by finding the most probable topics for each document, and ranking the words associated with each topic3. However, the NTM algorithm does not perform any text pre-processing by itself, so it relies on the quality of the input data. Therefore, the data scientist should replace the blog post data in the S3 bucket with the results of the vectorizer, to ensure that the NTM algorithm does not include the stop words in the tag recommendations.

The other options are not suitable for the following reasons:

Option A is not relevant because the Amazon Comprehend entity recognition API operations are used to detect and extract named entities from text, such as people, places, organizations, dates, etc4. This is not the same as removing stop words, which are common words that do not carry much meaning or information. Moreover, removing the detected entities from the blog post data may reduce the quality and diversity of the tag recommendations, as some entities may be relevant and useful as tags.

Option B is not optimal because the SageMaker built-in principal component analysis (PCA) algorithm is used to reduce the dimensionality of a dataset by finding the most important features that capture the maximum amount of variance in the data5. This is not the same as removing stop words, which are words that have low variance and high frequency in the data.Moreover, replacing the blog post data in the S3 bucket with the results of the PCA algorithm may not be compatible with the input format expected by the NTM algorithm, which requires a bag-of-words representation of the text2.

Option C is not suitable because the SageMaker built-in Object Detection algorithm is used to detect and localize objects in images6. This is not related to the task of recommending tags from blog posts, which are text documents. Moreover, using the Object Detection algorithm instead of the NTM algorithm would require a different type of input data (images instead of text), and a different type of output data (bounding boxes and labels instead of topics and words).

References:

Neural Topic Model (NTM) Algorithm

Introduction to the Amazon SageMaker Neural Topic Model

Amazon Comprehend - Entity Recognition

sklearn.feature_extraction.text.CountVectorizer

Principal Component Analysis (PCA) Algorithm

Object Detection Algorithm

asked 16/09/2024
Clive Roberts
47 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first