Amazon MLS-C01 Practice Test - Questions Answers, Page 10
List of questions
Question 91

A Data Scientist is working on an application that performs sentiment analysis. The validation accuracy is poor and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of words in the dataset
Which tool should be used to improve the validation accuracy?
Explanation:
Term frequency-inverse document frequency (TF-IDF) is a technique that assigns a weight to each word in a document based on how important it is to the meaning of the document. The term frequency (TF) measures how often a word appears in a document, while the inverse document frequency (IDF) measures how rare a word is across a collection of documents. The TF-IDF weight is the product of the TF and IDF values, and it is high for words that are frequent in a specific document but rare in the overall corpus. TF-IDF can help improve the validation accuracy of a sentiment analysis model by reducing the impact of common words that have little or no sentiment value, such as ''the'', ''a'', ''and'', etc. Scikit-learn is a popular Python library for machine learning that provides a TF-IDF vectorizer class that can transform a collection of text documents into a matrix of TF-IDF features. By using this tool, the Data Scientist can create a more informative and discriminative feature representation for the sentiment analysis task.
References:
TfidfVectorizer - scikit-learn
Text feature extraction - scikit-learn
TF-IDF for Beginners | by Jana Schmidt | Towards Data Science
Sentiment Analysis: Concept, Analysis and Applications | by Susan Li | Towards Data Science
Question 92

A Machine Learning Specialist is developing recommendation engine for a photography blog Given a picture, the recommendation engine should show a picture that captures similar objects The Specialist would like to create a numerical representation feature to perform nearest-neighbor searches
What actions would allow the Specialist to get relevant numerical representations?
Explanation:
A neural network pre-trained on ImageNet is a deep learning model that has been trained on a large dataset of images containing 1000 classes of objects. The model can learn to extract high-level features from the images that capture the semantic and visual information of the objects. The penultimate layer of the model is the layer before the final output layer, and it contains a feature vector that represents the input image in a lower-dimensional space. By running images through a pre-trained neural network and collecting the feature vectors from the penultimate layer, the Specialist can obtain relevant numerical representations that can be used for nearest-neighbor searches. The feature vectors can capture the similarity between images based on the presence and appearance of similar objects, and they can be compared using distance metrics such as Euclidean distance or cosine similarity. This approach can enable the recommendation engine to show a picture that captures similar objects to a given picture.
References:
ImageNet - Wikipedia
How to use a pre-trained neural network to extract features from images | by Rishabh Anand | Analytics Vidhya | Medium
Image Similarity using Deep Ranking | by Aditya Oke | Towards Data Science
Question 93

A gaming company has launched an online game where people can start playing for free but they need to pay if they choose to use certain features The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year The company has gathered a labeled dataset from 1 million users
The training dataset consists of 1.000 positive samples (from users who ended up paying within 1 year) and 999.000 negative samples (from users who did not use any paid features) Each data sample consists of 200 features including user age, device, location, and play patterns
Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set However, the prediction results on a test dataset were not satisfactory.
Which of the following approaches should the Data Science team take to mitigate this issue? (Select TWO.)
Explanation:
The Data Science team is facing a problem of imbalanced data, where the positive class (paid users) is much less frequent than the negative class (non-paid users). This can cause the random forest model to be biased towards the majority class and have poor performance on the minority class. To mitigate this issue, the Data Science team can try the following approaches:
C) Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. This is a technique called data augmentation, which can help increase the size and diversity of the training data for the minority class. This can help the random forest model learn more features and patterns from the positive class and reduce the imbalance ratio.
D) Change the cost function so that false negatives have a higher impact on the cost value than false positives. This is a technique called cost-sensitive learning, which can assign different weights or costs to different classes or errors. By assigning a higher cost to false negatives (predicting non-paid when the user is actually paid), the random forest model can be more sensitive to the minority class and try to minimize the misclassification of the positive class.
References:
Bagging and Random Forest for Imbalanced Classification
Surviving in a Random Forest with Imbalanced Datasets
machine learning - random forest for imbalanced data? - Cross Validated
Biased Random Forest For Dealing With the Class Imbalance Problem
Question 94

While reviewing the histogram for residuals on regression evaluation data a Machine Learning Specialist notices that the residuals do not form a zero-centered bell shape as shown What does this mean?
Explanation:
Residuals are the differences between the actual and predicted values of the target variable in a regression model. A histogram of residuals is a graphical tool that can help evaluate the performance and assumptions of the model. Ideally, the histogram of residuals should have a zero-centered bell shape, which indicates that the residuals are normally distributed with a mean of zero and a constant variance. This means that the model has captured the true relationship between the input and output variables, and that the errors are random and unbiased. However, if the histogram of residuals does not have a zero-centered bell shape, as shown in the image, this means that the model might have prediction errors over a range of target values. This is because the residuals do not form a symmetrical and homogeneous distribution around zero, which implies that the model has some systematic bias or heteroscedasticity. This can affect the accuracy and validity of the model, and indicate that the model needs to be improved or modified.
References:
Residual Analysis in Regression - Statistics By Jim
How to Check Residual Plots for Regression Analysis - dummies
Histogram of Residuals - Statistics How To
Question 95

During mini-batch training of a neural network for a classification problem, a Data Scientist notices that training accuracy oscillates What is the MOST likely cause of this issue?
Explanation:
Mini-batch gradient descent is a variant of gradient descent that updates the model parameters using a subset of the training data (called a mini-batch) at each iteration. The learning rate is a hyperparameter that controls how much the model parameters change in response to the gradient. If the learning rate is very high, the model parameters may overshoot the optimal values and oscillate around the minimum of the cost function. This can cause the training accuracy to fluctuate and prevent the model from converging to a stable solution.To avoid this issue, the learning rate should be chosen carefully, such as by using a learning rate decay schedule or an adaptive learning rate algorithm1.Alternatively, the batch size can be increased to reduce the variance of the gradient estimates2.However, the batch size should not be too big, as this can slow down the training process and reduce the generalization ability of the model3. Dataset shuffling and class distribution are not likely to cause oscillations in training accuracy, as they do not affect the gradient updates directly.Dataset shuffling can help avoid getting stuck in local minima and improve the convergence speed of mini-batch gradient descent4. Class distribution can affect the performance and fairness of the model, especially if the dataset is imbalanced, but it does not necessarily cause fluctuations in training accuracy.
Question 96

A Machine Learning Specialist observes several performance problems with the training portion of a machine learning solution on Amazon SageMaker The solution uses a large training dataset 2 TB in size and is using the SageMaker k-means algorithm The observed issues include the unacceptable length of time it takes before the training job launches and poor I/O throughput while training the model
What should the Specialist do to address the performance issues with the current solution?
Explanation:
The input mode for the training job determines how the training data is transferred from Amazon S3 to the SageMaker instance. There are two input modes: File and Pipe. File mode copies the entire training dataset from S3 to the local file system of the instance before starting the training job. This can cause a long delay before the training job launches, especially if the dataset is large. Pipe mode streams the data from S3 to the instance as the training job runs. This can reduce the startup time and improve the I/O throughput, as the data is read in smaller batches. Therefore, to address the performance issues with the current solution, the Specialist should ensure that the input mode for the training job is set to Pipe.This can be done by using the SageMaker Python SDK and setting the input_mode parameter to Pipe when creating the estimator or the fit method12.Alternatively, this can be done by using the AWS CLI and setting the InputMode parameter to Pipe when creating the training job3.
References:
Access Training Data - Amazon SageMaker
Choosing Data Input Mode Using the SageMaker Python SDK - Amazon SageMaker
CreateTrainingJob - Amazon SageMaker Service
Question 97

A Machine Learning Specialist is building a convolutional neural network (CNN) that will classify 10 types of animals. The Specialist has built a series of layers in a neural network that will take an input image of an animal, pass it through a series of convolutional and pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes The Specialist would like to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes
Which function will produce the desired output?
Explanation:
The softmax function is a function that can transform a vector of arbitrary real values into a vector of real values in the range (0,1) that sum to 1. This means that the softmax function can produce a valid probability distribution over multiple classes. The softmax function is often used as the activation function of the output layer in a neural network, especially for multi-class classification problems. The softmax function can assign higher probabilities to the classes with higher scores, which allows the network to make predictions based on the most likely class. In this case, the Machine Learning Specialist wants to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes of animals. Therefore, the softmax function is the most suitable function to produce the desired output.
References:
Softmax Activation Function for Deep Learning: A Complete Guide
What is Softmax in Machine Learning? - reason.town
machine learning - Why is the softmax function often used as activation ...
Multi-Class Neural Networks: Softmax | Machine Learning | Google for ...
Question 98

A Machine Learning Specialist is building a model that will perform time series forecasting using Amazon SageMaker The Specialist has finished training the model and is now planning to perform load testing on the endpoint so they can configure Auto Scaling for the model variant
Which approach will allow the Specialist to review the latency, memory utilization, and CPU utilization during the load test'?
Explanation:
Amazon CloudWatch is a service that can monitor and collect various metrics and logs from AWS resources, such as Amazon SageMaker. Amazon CloudWatch can also generate dashboards to create a single view for the metrics and logs that are of interest. By using Amazon CloudWatch, the Machine Learning Specialist can review the latency, memory utilization, and CPU utilization during the load test, as these are some of the metrics that are outputted by Amazon SageMaker. The Specialist can create a custom dashboard that displays these metrics in different widgets, such as graphs, tables, or text. The dashboard can also be configured to refresh automatically and show the latest data as the load test is running. This approach will allow the Specialist to monitor the performance and resource utilization of the model variant and adjust the Auto Scaling configuration accordingly.
References:
[Monitoring Amazon SageMaker with Amazon CloudWatch - Amazon SageMaker]
[Using Amazon CloudWatch Dashboards - Amazon CloudWatch]
[Create a CloudWatch Dashboard - Amazon CloudWatch]
Question 99

An Amazon SageMaker notebook instance is launched into Amazon VPC The SageMaker notebook references data contained in an Amazon S3 bucket in another account The bucket is encrypted using SSE-KMS The instance returns an access denied error when trying to access data in Amazon S3.
Which of the following are required to access the bucket and avoid the access denied error? (Select THREE)
Explanation:
To access an Amazon S3 bucket in another account that is encrypted using SSE-KMS, the following are required:
A) An AWS KMS key policy that allows access to the customer master key (CMK). The CMK is the encryption key that is used to encrypt and decrypt the data in the S3 bucket. The KMS key policy defines who can use and manage the CMK. To allow access to the CMK from another account, the key policy must include a statement that grants the necessary permissions (such as kms:Decrypt) to the principal from the other account (such as the SageMaker notebook IAM role).
B) A SageMaker notebook security group that allows access to Amazon S3. A security group is a virtual firewall that controls the inbound and outbound traffic for the SageMaker notebook instance. To allow the notebook instance to access the S3 bucket, the security group must have a rule that allows outbound traffic to the S3 endpoint on port 443 (HTTPS).
C) An IAM role that allows access to the specific S3 bucket. An IAM role is an identity that can be assumed by the SageMaker notebook instance to access AWS resources. The IAM role must have a policy that grants the necessary permissions (such as s3:GetObject) to access the specific S3 bucket. The policy must also include a condition that allows access to the CMK in the other account.
The following are not required or correct:
D) A permissive S3 bucket policy. A bucket policy is a resource-based policy that defines who can access the S3 bucket and what actions they can perform. A permissive bucket policy is not required and not recommended, as it can expose the bucket to unauthorized access. A bucket policy should follow the principle of least privilege and grant the minimum permissions necessary to the specific principals that need access.
E) An S3 bucket owner that matches the notebook owner. The S3 bucket owner and the notebook owner do not need to match, as long as the bucket owner grants cross-account access to the notebook owner through the KMS key policy and the bucket policy (if applicable).
F) A SegaMaker notebook subnet ACL that allow traffic to Amazon S3. A subnet ACL is a network access control list that acts as an optional layer of security for the SageMaker notebook instance's subnet. A subnet ACL is not required to access the S3 bucket, as the security group is sufficient to control the traffic. However, if a subnet ACL is used, it must not block the traffic to the S3 endpoint.
Question 100

A monitoring service generates 1 TB of scale metrics record data every minute A Research team performs queries on this data using Amazon Athena The queries run slowly due to the large volume of data, and the team requires better performance
How should the records be stored in Amazon S3 to improve query performance?
Explanation:
Parquet is a columnar storage format that can store data in a compressed and efficient way. Parquet files can improve query performance by reducing the amount of data that needs to be scanned, as only the relevant columns are read from the files. Parquet files can also support predicate pushdown, which means that the filtering conditions are applied at the storage level, further reducing the data that needs to be processed. Parquet files are compatible with Amazon Athena, which can leverage the benefits of the columnar format and provide faster and cheaper queries. Therefore, the records should be stored in Parquet files in Amazon S3 to improve query performance.
References:
Columnar Storage Formats - Amazon Athena
Parquet SerDe - Amazon Athena
Optimizing Amazon Athena Queries - Amazon Athena
Parquet - Apache Software Foundation
Question