ExamGecko
Home Home / Amazon / MLS-C01

Amazon MLS-C01 Practice Test - Questions Answers, Page 10

Question list
Search
Search

List of questions

Search

Related questions











A Data Scientist is working on an application that performs sentiment analysis. The validation accuracy is poor and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of words in the dataset

Which tool should be used to improve the validation accuracy?

A.
Amazon Comprehend syntax analysts and entity detection
A.
Amazon Comprehend syntax analysts and entity detection
Answers
B.
Amazon SageMaker BlazingText allow mode
B.
Amazon SageMaker BlazingText allow mode
Answers
C.
Natural Language Toolkit (NLTK) stemming and stop word removal
C.
Natural Language Toolkit (NLTK) stemming and stop word removal
Answers
D.
Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizers
D.
Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizers
Answers
Suggested answer: D

Explanation:

Term frequency-inverse document frequency (TF-IDF) is a technique that assigns a weight to each word in a document based on how important it is to the meaning of the document. The term frequency (TF) measures how often a word appears in a document, while the inverse document frequency (IDF) measures how rare a word is across a collection of documents. The TF-IDF weight is the product of the TF and IDF values, and it is high for words that are frequent in a specific document but rare in the overall corpus. TF-IDF can help improve the validation accuracy of a sentiment analysis model by reducing the impact of common words that have little or no sentiment value, such as ''the'', ''a'', ''and'', etc. Scikit-learn is a popular Python library for machine learning that provides a TF-IDF vectorizer class that can transform a collection of text documents into a matrix of TF-IDF features. By using this tool, the Data Scientist can create a more informative and discriminative feature representation for the sentiment analysis task.

References:

TfidfVectorizer - scikit-learn

Text feature extraction - scikit-learn

TF-IDF for Beginners | by Jana Schmidt | Towards Data Science

Sentiment Analysis: Concept, Analysis and Applications | by Susan Li | Towards Data Science

A Machine Learning Specialist is developing recommendation engine for a photography blog Given a picture, the recommendation engine should show a picture that captures similar objects The Specialist would like to create a numerical representation feature to perform nearest-neighbor searches

What actions would allow the Specialist to get relevant numerical representations?

A.
Reduce image resolution and use reduced resolution pixel values as features
A.
Reduce image resolution and use reduced resolution pixel values as features
Answers
B.
Use Amazon Mechanical Turk to label image content and create a one-hot representation indicating the presence of specific labels
B.
Use Amazon Mechanical Turk to label image content and create a one-hot representation indicating the presence of specific labels
Answers
C.
Run images through a neural network pie-trained on ImageNet, and collect the feature vectors from the penultimate layer
C.
Run images through a neural network pie-trained on ImageNet, and collect the feature vectors from the penultimate layer
Answers
D.
Average colors by channel to obtain three-dimensional representations of images.
D.
Average colors by channel to obtain three-dimensional representations of images.
Answers
Suggested answer: C

Explanation:

A neural network pre-trained on ImageNet is a deep learning model that has been trained on a large dataset of images containing 1000 classes of objects. The model can learn to extract high-level features from the images that capture the semantic and visual information of the objects. The penultimate layer of the model is the layer before the final output layer, and it contains a feature vector that represents the input image in a lower-dimensional space. By running images through a pre-trained neural network and collecting the feature vectors from the penultimate layer, the Specialist can obtain relevant numerical representations that can be used for nearest-neighbor searches. The feature vectors can capture the similarity between images based on the presence and appearance of similar objects, and they can be compared using distance metrics such as Euclidean distance or cosine similarity. This approach can enable the recommendation engine to show a picture that captures similar objects to a given picture.

References:

ImageNet - Wikipedia

How to use a pre-trained neural network to extract features from images | by Rishabh Anand | Analytics Vidhya | Medium

Image Similarity using Deep Ranking | by Aditya Oke | Towards Data Science

A gaming company has launched an online game where people can start playing for free but they need to pay if they choose to use certain features The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year The company has gathered a labeled dataset from 1 million users

The training dataset consists of 1.000 positive samples (from users who ended up paying within 1 year) and 999.000 negative samples (from users who did not use any paid features) Each data sample consists of 200 features including user age, device, location, and play patterns

Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set However, the prediction results on a test dataset were not satisfactory.

Which of the following approaches should the Data Science team take to mitigate this issue? (Select TWO.)

A.
Add more deep trees to the random forest to enable the model to learn more features.
A.
Add more deep trees to the random forest to enable the model to learn more features.
Answers
B.
indicate a copy of the samples in the test database in the training dataset
B.
indicate a copy of the samples in the test database in the training dataset
Answers
C.
Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.
C.
Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.
Answers
D.
Change the cost function so that false negatives have a higher impact on the cost value than false positives
D.
Change the cost function so that false negatives have a higher impact on the cost value than false positives
Answers
E.
Change the cost function so that false positives have a higher impact on the cost value than false negatives
E.
Change the cost function so that false positives have a higher impact on the cost value than false negatives
Answers
Suggested answer: C, D

Explanation:

The Data Science team is facing a problem of imbalanced data, where the positive class (paid users) is much less frequent than the negative class (non-paid users). This can cause the random forest model to be biased towards the majority class and have poor performance on the minority class. To mitigate this issue, the Data Science team can try the following approaches:

C) Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. This is a technique called data augmentation, which can help increase the size and diversity of the training data for the minority class. This can help the random forest model learn more features and patterns from the positive class and reduce the imbalance ratio.

D) Change the cost function so that false negatives have a higher impact on the cost value than false positives. This is a technique called cost-sensitive learning, which can assign different weights or costs to different classes or errors. By assigning a higher cost to false negatives (predicting non-paid when the user is actually paid), the random forest model can be more sensitive to the minority class and try to minimize the misclassification of the positive class.

References:

Bagging and Random Forest for Imbalanced Classification

Surviving in a Random Forest with Imbalanced Datasets

machine learning - random forest for imbalanced data? - Cross Validated

Biased Random Forest For Dealing With the Class Imbalance Problem

While reviewing the histogram for residuals on regression evaluation data a Machine Learning Specialist notices that the residuals do not form a zero-centered bell shape as shown What does this mean?

A.
The model might have prediction errors over a range of target values.
A.
The model might have prediction errors over a range of target values.
Answers
B.
The dataset cannot be accurately represented using the regression model
B.
The dataset cannot be accurately represented using the regression model
Answers
C.
There are too many variables in the model
C.
There are too many variables in the model
Answers
D.
The model is predicting its target values perfectly.
D.
The model is predicting its target values perfectly.
Answers
Suggested answer: A

Explanation:

Residuals are the differences between the actual and predicted values of the target variable in a regression model. A histogram of residuals is a graphical tool that can help evaluate the performance and assumptions of the model. Ideally, the histogram of residuals should have a zero-centered bell shape, which indicates that the residuals are normally distributed with a mean of zero and a constant variance. This means that the model has captured the true relationship between the input and output variables, and that the errors are random and unbiased. However, if the histogram of residuals does not have a zero-centered bell shape, as shown in the image, this means that the model might have prediction errors over a range of target values. This is because the residuals do not form a symmetrical and homogeneous distribution around zero, which implies that the model has some systematic bias or heteroscedasticity. This can affect the accuracy and validity of the model, and indicate that the model needs to be improved or modified.

References:

Residual Analysis in Regression - Statistics By Jim

How to Check Residual Plots for Regression Analysis - dummies

Histogram of Residuals - Statistics How To

During mini-batch training of a neural network for a classification problem, a Data Scientist notices that training accuracy oscillates What is the MOST likely cause of this issue?

A.
The class distribution in the dataset is imbalanced
A.
The class distribution in the dataset is imbalanced
Answers
B.
Dataset shuffling is disabled
B.
Dataset shuffling is disabled
Answers
C.
The batch size is too big
C.
The batch size is too big
Answers
D.
The learning rate is very high
D.
The learning rate is very high
Answers
Suggested answer: D

Explanation:

Mini-batch gradient descent is a variant of gradient descent that updates the model parameters using a subset of the training data (called a mini-batch) at each iteration. The learning rate is a hyperparameter that controls how much the model parameters change in response to the gradient. If the learning rate is very high, the model parameters may overshoot the optimal values and oscillate around the minimum of the cost function. This can cause the training accuracy to fluctuate and prevent the model from converging to a stable solution.To avoid this issue, the learning rate should be chosen carefully, such as by using a learning rate decay schedule or an adaptive learning rate algorithm1.Alternatively, the batch size can be increased to reduce the variance of the gradient estimates2.However, the batch size should not be too big, as this can slow down the training process and reduce the generalization ability of the model3. Dataset shuffling and class distribution are not likely to cause oscillations in training accuracy, as they do not affect the gradient updates directly.Dataset shuffling can help avoid getting stuck in local minima and improve the convergence speed of mini-batch gradient descent4. Class distribution can affect the performance and fairness of the model, especially if the dataset is imbalanced, but it does not necessarily cause fluctuations in training accuracy.

A Machine Learning Specialist observes several performance problems with the training portion of a machine learning solution on Amazon SageMaker The solution uses a large training dataset 2 TB in size and is using the SageMaker k-means algorithm The observed issues include the unacceptable length of time it takes before the training job launches and poor I/O throughput while training the model

What should the Specialist do to address the performance issues with the current solution?

A.
Use the SageMaker batch transform feature
A.
Use the SageMaker batch transform feature
Answers
B.
Compress the training data into Apache Parquet format.
B.
Compress the training data into Apache Parquet format.
Answers
C.
Ensure that the input mode for the training job is set to Pipe.
C.
Ensure that the input mode for the training job is set to Pipe.
Answers
D.
Copy the training dataset to an Amazon EFS volume mounted on the SageMaker instance.
D.
Copy the training dataset to an Amazon EFS volume mounted on the SageMaker instance.
Answers
Suggested answer: C

Explanation:

The input mode for the training job determines how the training data is transferred from Amazon S3 to the SageMaker instance. There are two input modes: File and Pipe. File mode copies the entire training dataset from S3 to the local file system of the instance before starting the training job. This can cause a long delay before the training job launches, especially if the dataset is large. Pipe mode streams the data from S3 to the instance as the training job runs. This can reduce the startup time and improve the I/O throughput, as the data is read in smaller batches. Therefore, to address the performance issues with the current solution, the Specialist should ensure that the input mode for the training job is set to Pipe.This can be done by using the SageMaker Python SDK and setting the input_mode parameter to Pipe when creating the estimator or the fit method12.Alternatively, this can be done by using the AWS CLI and setting the InputMode parameter to Pipe when creating the training job3.

References:

Access Training Data - Amazon SageMaker

Choosing Data Input Mode Using the SageMaker Python SDK - Amazon SageMaker

CreateTrainingJob - Amazon SageMaker Service

A Machine Learning Specialist is building a convolutional neural network (CNN) that will classify 10 types of animals. The Specialist has built a series of layers in a neural network that will take an input image of an animal, pass it through a series of convolutional and pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes The Specialist would like to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes

Which function will produce the desired output?

A.
Dropout
A.
Dropout
Answers
B.
Smooth L1 loss
B.
Smooth L1 loss
Answers
C.
Softmax
C.
Softmax
Answers
D.
Rectified linear units (ReLU)
D.
Rectified linear units (ReLU)
Answers
Suggested answer: C

Explanation:

The softmax function is a function that can transform a vector of arbitrary real values into a vector of real values in the range (0,1) that sum to 1. This means that the softmax function can produce a valid probability distribution over multiple classes. The softmax function is often used as the activation function of the output layer in a neural network, especially for multi-class classification problems. The softmax function can assign higher probabilities to the classes with higher scores, which allows the network to make predictions based on the most likely class. In this case, the Machine Learning Specialist wants to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes of animals. Therefore, the softmax function is the most suitable function to produce the desired output.

References:

Softmax Activation Function for Deep Learning: A Complete Guide

What is Softmax in Machine Learning? - reason.town

machine learning - Why is the softmax function often used as activation ...

Multi-Class Neural Networks: Softmax | Machine Learning | Google for ...

A Machine Learning Specialist is building a model that will perform time series forecasting using Amazon SageMaker The Specialist has finished training the model and is now planning to perform load testing on the endpoint so they can configure Auto Scaling for the model variant

Which approach will allow the Specialist to review the latency, memory utilization, and CPU utilization during the load test'?

A.
Review SageMaker logs that have been written to Amazon S3 by leveraging Amazon Athena and Amazon OuickSight to visualize logs as they are being produced
A.
Review SageMaker logs that have been written to Amazon S3 by leveraging Amazon Athena and Amazon OuickSight to visualize logs as they are being produced
Answers
B.
Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker
B.
Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker
Answers
C.
Build custom Amazon CloudWatch Logs and then leverage Amazon ES and Kibana to query and visualize the data as it is generated by Amazon SageMaker
C.
Build custom Amazon CloudWatch Logs and then leverage Amazon ES and Kibana to query and visualize the data as it is generated by Amazon SageMaker
Answers
D.
Send Amazon CloudWatch Logs that were generated by Amazon SageMaker lo Amazon ES and use Kibana to query and visualize the log data.
D.
Send Amazon CloudWatch Logs that were generated by Amazon SageMaker lo Amazon ES and use Kibana to query and visualize the log data.
Answers
Suggested answer: B

Explanation:

Amazon CloudWatch is a service that can monitor and collect various metrics and logs from AWS resources, such as Amazon SageMaker. Amazon CloudWatch can also generate dashboards to create a single view for the metrics and logs that are of interest. By using Amazon CloudWatch, the Machine Learning Specialist can review the latency, memory utilization, and CPU utilization during the load test, as these are some of the metrics that are outputted by Amazon SageMaker. The Specialist can create a custom dashboard that displays these metrics in different widgets, such as graphs, tables, or text. The dashboard can also be configured to refresh automatically and show the latest data as the load test is running. This approach will allow the Specialist to monitor the performance and resource utilization of the model variant and adjust the Auto Scaling configuration accordingly.

References:

[Monitoring Amazon SageMaker with Amazon CloudWatch - Amazon SageMaker]

[Using Amazon CloudWatch Dashboards - Amazon CloudWatch]

[Create a CloudWatch Dashboard - Amazon CloudWatch]

An Amazon SageMaker notebook instance is launched into Amazon VPC The SageMaker notebook references data contained in an Amazon S3 bucket in another account The bucket is encrypted using SSE-KMS The instance returns an access denied error when trying to access data in Amazon S3.

Which of the following are required to access the bucket and avoid the access denied error? (Select THREE)

A.
An AWS KMS key policy that allows access to the customer master key (CMK)
A.
An AWS KMS key policy that allows access to the customer master key (CMK)
Answers
B.
A SageMaker notebook security group that allows access to Amazon S3
B.
A SageMaker notebook security group that allows access to Amazon S3
Answers
C.
An 1AM role that allows access to the specific S3 bucket
C.
An 1AM role that allows access to the specific S3 bucket
Answers
D.
A permissive S3 bucket policy
D.
A permissive S3 bucket policy
Answers
E.
An S3 bucket owner that matches the notebook owner
E.
An S3 bucket owner that matches the notebook owner
Answers
F.
A SegaMaker notebook subnet ACL that allow traffic to Amazon S3.
F.
A SegaMaker notebook subnet ACL that allow traffic to Amazon S3.
Answers
Suggested answer: A, B, C

Explanation:

To access an Amazon S3 bucket in another account that is encrypted using SSE-KMS, the following are required:

A) An AWS KMS key policy that allows access to the customer master key (CMK). The CMK is the encryption key that is used to encrypt and decrypt the data in the S3 bucket. The KMS key policy defines who can use and manage the CMK. To allow access to the CMK from another account, the key policy must include a statement that grants the necessary permissions (such as kms:Decrypt) to the principal from the other account (such as the SageMaker notebook IAM role).

B) A SageMaker notebook security group that allows access to Amazon S3. A security group is a virtual firewall that controls the inbound and outbound traffic for the SageMaker notebook instance. To allow the notebook instance to access the S3 bucket, the security group must have a rule that allows outbound traffic to the S3 endpoint on port 443 (HTTPS).

C) An IAM role that allows access to the specific S3 bucket. An IAM role is an identity that can be assumed by the SageMaker notebook instance to access AWS resources. The IAM role must have a policy that grants the necessary permissions (such as s3:GetObject) to access the specific S3 bucket. The policy must also include a condition that allows access to the CMK in the other account.

The following are not required or correct:

D) A permissive S3 bucket policy. A bucket policy is a resource-based policy that defines who can access the S3 bucket and what actions they can perform. A permissive bucket policy is not required and not recommended, as it can expose the bucket to unauthorized access. A bucket policy should follow the principle of least privilege and grant the minimum permissions necessary to the specific principals that need access.

E) An S3 bucket owner that matches the notebook owner. The S3 bucket owner and the notebook owner do not need to match, as long as the bucket owner grants cross-account access to the notebook owner through the KMS key policy and the bucket policy (if applicable).

F) A SegaMaker notebook subnet ACL that allow traffic to Amazon S3. A subnet ACL is a network access control list that acts as an optional layer of security for the SageMaker notebook instance's subnet. A subnet ACL is not required to access the S3 bucket, as the security group is sufficient to control the traffic. However, if a subnet ACL is used, it must not block the traffic to the S3 endpoint.

A monitoring service generates 1 TB of scale metrics record data every minute A Research team performs queries on this data using Amazon Athena The queries run slowly due to the large volume of data, and the team requires better performance

How should the records be stored in Amazon S3 to improve query performance?

A.
CSV files
A.
CSV files
Answers
B.
Parquet files
B.
Parquet files
Answers
C.
Compressed JSON
C.
Compressed JSON
Answers
D.
RecordIO
D.
RecordIO
Answers
Suggested answer: B

Explanation:

Parquet is a columnar storage format that can store data in a compressed and efficient way. Parquet files can improve query performance by reducing the amount of data that needs to be scanned, as only the relevant columns are read from the files. Parquet files can also support predicate pushdown, which means that the filtering conditions are applied at the storage level, further reducing the data that needs to be processed. Parquet files are compatible with Amazon Athena, which can leverage the benefits of the columnar format and provide faster and cheaper queries. Therefore, the records should be stored in Parquet files in Amazon S3 to improve query performance.

References:

Columnar Storage Formats - Amazon Athena

Parquet SerDe - Amazon Athena

Optimizing Amazon Athena Queries - Amazon Athena

Parquet - Apache Software Foundation

Total 308 questions
Go to page: of 31