ExamGecko
Home Home / Amazon / MLS-C01

Amazon MLS-C01 Practice Test - Questions Answers, Page 11

Question list
Search
Search

List of questions

Search

Related questions











A Machine Learning Specialist needs to create a data repository to hold a large amount of time-based training data for a new model. In the source system, new files are added every hour Throughout a single 24-hour period, the volume of hourly updates will change significantly. The Specialist always wants to train on the last 24 hours of the data

Which type of data repository is the MOST cost-effective solution?

A.
An Amazon EBS-backed Amazon EC2 instance with hourly directories
A.
An Amazon EBS-backed Amazon EC2 instance with hourly directories
Answers
B.
An Amazon RDS database with hourly table partitions
B.
An Amazon RDS database with hourly table partitions
Answers
C.
An Amazon S3 data lake with hourly object prefixes
C.
An Amazon S3 data lake with hourly object prefixes
Answers
D.
An Amazon EMR cluster with hourly hive partitions on Amazon EBS volumes
D.
An Amazon EMR cluster with hourly hive partitions on Amazon EBS volumes
Answers
Suggested answer: C

Explanation:

An Amazon S3 data lake is a cost-effective solution for storing and analyzing large amounts of time-based training data for a new model. Amazon S3 is a highly scalable, durable, and secure object storage service that can store any amount of data in any format. Amazon S3 also offers low-cost storage classes, such as S3 Standard-IA and S3 One Zone-IA, that can reduce the storage costs for infrequently accessed data. By using hourly object prefixes, the Machine Learning Specialist can organize the data into logical partitions based on the time of ingestion. This can enable efficient data access and management, as well as support incremental updates and deletes. The Specialist can also use Amazon S3 lifecycle policies to automatically transition the data to lower-cost storage classes or delete the data after a certain period of time. This way, the Specialist can always train on the last 24 hours of the data and optimize the storage costs.

References:

What is a data lake? - Amazon Web Services

Amazon S3 Storage Classes - Amazon Simple Storage Service

Managing your storage lifecycle - Amazon Simple Storage Service

Best Practices Design Patterns: Optimizing Amazon S3 Performance

A retail chain has been ingesting purchasing records from its network of 20,000 stores to Amazon S3 using Amazon Kinesis Data Firehose To support training an improved machine learning model, training records will require new but simple transformations, and some attributes will be combined The model needs lo be retrained daily

Given the large number of stores and the legacy data ingestion, which change will require the LEAST amount of development effort?

A.
Require that the stores to switch to capturing their data locally on AWS Storage Gateway for loading into Amazon S3 then use AWS Glue to do the transformation
A.
Require that the stores to switch to capturing their data locally on AWS Storage Gateway for loading into Amazon S3 then use AWS Glue to do the transformation
Answers
B.
Deploy an Amazon EMR cluster running Apache Spark with the transformation logic, and have the cluster run each day on the accumulating records in Amazon S3, outputting new/transformed records to Amazon S3
B.
Deploy an Amazon EMR cluster running Apache Spark with the transformation logic, and have the cluster run each day on the accumulating records in Amazon S3, outputting new/transformed records to Amazon S3
Answers
C.
Spin up a fleet of Amazon EC2 instances with the transformation logic, have them transform the data records accumulating on Amazon S3, and output the transformed records to Amazon S3.
C.
Spin up a fleet of Amazon EC2 instances with the transformation logic, have them transform the data records accumulating on Amazon S3, and output the transformed records to Amazon S3.
Answers
D.
Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehouse stream that transforms raw record attributes into simple transformed values using SQL.
D.
Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehouse stream that transforms raw record attributes into simple transformed values using SQL.
Answers
Suggested answer: D

Explanation:

Amazon Kinesis Data Analytics is a service that can analyze streaming data in real time using SQL or Apache Flink applications. It can also use machine learning algorithms, such as Random Cut Forest (RCF), to perform anomaly detection on streaming data. By inserting a Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream, the retail chain can transform the raw record attributes into simple transformed values using SQL queries. This can be done without changing the existing data ingestion process or deploying additional resources. The transformed records can then be outputted to another Kinesis Data Firehose stream that delivers them to Amazon S3 for training the machine learning model. This approach will require the least amount of development effort, as it leverages the existing Kinesis Data Firehose stream and the built-in SQL capabilities of Kinesis Data Analytics.

References:

Amazon Kinesis Data Analytics - Amazon Web Services

Anomaly Detection with Amazon Kinesis Data Analytics - Amazon Web Services

Amazon Kinesis Data Firehose - Amazon Web Services

Amazon S3 - Amazon Web Services

A city wants to monitor its air quality to address the consequences of air pollution A Machine Learning Specialist needs to forecast the air quality in parts per million of contaminates for the next 2 days in the city as this is a prototype, only daily data from the last year is available

Which model is MOST likely to provide the best results in Amazon SageMaker?

A.
Use the Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
A.
Use the Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
Answers
B.
Use Amazon SageMaker Random Cut Forest (RCF) on the single time series consisting of the full year of data.
B.
Use Amazon SageMaker Random Cut Forest (RCF) on the single time series consisting of the full year of data.
Answers
C.
Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
C.
Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
Answers
D.
Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of classifier.
D.
Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of classifier.
Answers
Suggested answer: A

Explanation:

The Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm is a supervised learning algorithm that can perform both classification and regression tasks. It can also handle time series data, such as the air quality data in this case. The kNN algorithm works by finding the k most similar instances in the training data to a given query instance, and then predicting the output based on the average or majority of the outputs of the k nearest neighbors. The kNN algorithm can be configured to use different distance metrics, such as Euclidean or cosine, to measure the similarity between instances. To use the kNN algorithm on the single time series consisting of the full year of data, the Machine Learning Specialist needs to set the predictor_type parameter to regressor, as the output variable (air quality in parts per million of contaminates) is a continuous value. The kNN algorithm can then forecast the air quality for the next 2 days by finding the k most similar days in the past year and averaging their air quality values.

References:

Amazon SageMaker k-Nearest-Neighbors (kNN) Algorithm - Amazon SageMaker

Time Series Forecasting using k-Nearest Neighbors (kNN) in Python | by ...

Time Series Forecasting with k-Nearest Neighbors | by Nishant Malik ...

For the given confusion matrix, what is the recall and precision of the model?

A.
Recall = 0.92 Precision = 0.84
A.
Recall = 0.92 Precision = 0.84
Answers
B.
Recall = 0.84 Precision = 0.8
B.
Recall = 0.84 Precision = 0.8
Answers
C.
Recall = 0.92 Precision = 0.8
C.
Recall = 0.92 Precision = 0.8
Answers
D.
Recall = 0.8 Precision = 0.92
D.
Recall = 0.8 Precision = 0.92
Answers
Suggested answer: C

Explanation:

Recall and precision are two metrics that can be used to evaluate the performance of a classification model. Recall is the ratio of true positives to the total number of actual positives, which measures how well the model can identify all the relevant cases. Precision is the ratio of true positives to the total number of predicted positives, which measures how accurate the model is when it makes a positive prediction. Based on the confusion matrix in the image, we can calculate the recall and precision as follows:

Recall = TP / (TP + FN) = 12 / (12 + 1) = 0.92

Precision = TP / (TP + FP) = 12 / (12 + 3) = 0.8

Where TP is the number of true positives, FN is the number of false negatives, and FP is the number of false positives. Therefore, the recall and precision of the model are 0.92 and 0.8, respectively.

A Machine Learning Specialist is working with a media company to perform classification on popular articles from the company's website. The company is using random forests to classify how popular an article will be before it is published A sample of the data being used is below.

Given the dataset, the Specialist wants to convert the Day-Of_Week column to binary values.

What technique should be used to convert this column to binary values.

A.
Binarization
A.
Binarization
Answers
B.
One-hot encoding
B.
One-hot encoding
Answers
C.
Tokenization
C.
Tokenization
Answers
D.
Normalization transformation
D.
Normalization transformation
Answers
Suggested answer: B

Explanation:

One-hot encoding is a technique that can be used to convert a categorical variable, such as the Day-Of_Week column, to binary values. One-hot encoding creates a new binary column for each unique value in the original column, and assigns a value of 1 to the column that corresponds to the value in the original column, and 0 to the rest. For example, if the original column has values Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday, one-hot encoding will create seven new columns, each representing one day of the week. If the value in the original column is Tuesday, then the column for Tuesday will have a value of 1, and the other columns will have a value of 0. One-hot encoding can help improve the performance of machine learning models, as it eliminates the ordinal relationship between the values and creates a more informative and sparse representation of the data.

References:

One-Hot Encoding - Amazon SageMaker

One-Hot Encoding: A Simple Guide for Beginners | by Jana Schmidt ...

One-Hot Encoding in Machine Learning | by Nishant Malik | Towards ...

A company has raw user and transaction data stored in AmazonS3 a MySQL database, and Amazon RedShift A Data Scientist needs to perform an analysis by joining the three datasets from Amazon S3, MySQL, and Amazon RedShift, and then calculating the average-of a few selected columns from the joined data

Which AWS service should the Data Scientist use?

A.
Amazon Athena
A.
Amazon Athena
Answers
B.
Amazon Redshift Spectrum
B.
Amazon Redshift Spectrum
Answers
C.
AWS Glue
C.
AWS Glue
Answers
D.
Amazon QuickSight
D.
Amazon QuickSight
Answers
Suggested answer: A

Explanation:

Amazon Athena is a serverless interactive query service that can analyze data in Amazon S3 using standard SQL. Amazon Athena can also query data from other sources, such as MySQL and Amazon Redshift, by using federated queries. Federated queries allow Amazon Athena to run SQL queries across data sources, such as relational and non-relational databases, data warehouses, and data lakes. By using Amazon Athena, the Data Scientist can perform an analysis by joining the three datasets from Amazon S3, MySQL, and Amazon Redshift, and then calculating the average of a few selected columns from the joined data. Amazon Athena can also integrate with other AWS services, such as AWS Glue and Amazon QuickSight, to provide additional features, such as data cataloging and visualization.

References:

What is Amazon Athena? - Amazon Athena

Federated Query Overview - Amazon Athena

Querying Data from Amazon S3 - Amazon Athena

Querying Data from MySQL - Amazon Athena

[Querying Data from Amazon Redshift - Amazon Athena]

A Mobile Network Operator is building an analytics platform to analyze and optimize a company's operations using Amazon Athena and Amazon S3

The source systems send data in CSV format in real lime The Data Engineering team wants to transform the data to the Apache Parquet format before storing it on Amazon S3

Which solution takes the LEAST effort to implement?

A.
Ingest .CSV data using Apache Kafka Streams on Amazon EC2 instances and use Kafka Connect S3 to serialize data as Parquet
A.
Ingest .CSV data using Apache Kafka Streams on Amazon EC2 instances and use Kafka Connect S3 to serialize data as Parquet
Answers
B.
Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet.
B.
Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet.
Answers
C.
Ingest .CSV data using Apache Spark Structured Streaming in an Amazon EMR cluster and use Apache Spark to convert data into Parquet.
C.
Ingest .CSV data using Apache Spark Structured Streaming in an Amazon EMR cluster and use Apache Spark to convert data into Parquet.
Answers
D.
Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert data into Parquet.
D.
Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert data into Parquet.
Answers
Suggested answer: D

Explanation:

Amazon Kinesis Data Streams is a service that can capture, store, and process streaming data in real time. Amazon Kinesis Data Firehose is a service that can deliver streaming data to various destinations, such as Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service. Amazon Kinesis Data Firehose can also transform the data before delivering it, such as converting the data format, compressing the data, or encrypting the data. One of the supported data formats that Amazon Kinesis Data Firehose can convert to is Apache Parquet, which is a columnar storage format that can improve the performance and cost-efficiency of analytics queries. By using Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose, the Mobile Network Operator can ingest the .CSV data from the source systems and use Amazon Kinesis Data Firehose to convert the data into Parquet before storing it on Amazon S3. This solution takes the least effort to implement, as it does not require any additional resources, such as Amazon EC2 instances, Amazon EMR clusters, or Amazon Glue jobs. The solution can also leverage the built-in features of Amazon Kinesis Data Firehose, such as data buffering, batching, retry, and error handling.

References:

Amazon Kinesis Data Streams - Amazon Web Services

Amazon Kinesis Data Firehose - Amazon Web Services

Data Transformation - Amazon Kinesis Data Firehose

Apache Parquet - Amazon Athena

An e-commerce company needs a customized training model to classify images of its shirts and pants products The company needs a proof of concept in 2 to 3 days with good accuracy Which compute choice should the Machine Learning Specialist select to train and achieve good accuracy on the model quickly?

A.
m5 4xlarge (general purpose)
A.
m5 4xlarge (general purpose)
Answers
B.
r5.2xlarge (memory optimized)
B.
r5.2xlarge (memory optimized)
Answers
C.
p3.2xlarge (GPU accelerated computing)
C.
p3.2xlarge (GPU accelerated computing)
Answers
D.
p3 8xlarge (GPU accelerated computing)
D.
p3 8xlarge (GPU accelerated computing)
Answers
Suggested answer: C

Explanation:

Image classification is a machine learning task that involves assigning labels to images based on their content. Image classification can be performed using various algorithms, such as convolutional neural networks (CNNs), which are a type of deep learning model that can learn to extract high-level features from images. To train a customized image classification model, the e-commerce company needs a compute choice that can support the high computational demands of deep learning and provide good accuracy on the model quickly. A GPU accelerated computing instance, such as p3.2xlarge, is a suitable choice for this task, as it can leverage the parallel processing power of GPUs to speed up the training process and reduce the training time. A p3.2xlarge instance has one NVIDIA Tesla V100 GPU, which can provide up to 125 teraflops of mixed-precision performance and 16 GB of GPU memory. A p3.2xlarge instance can also use various deep learning frameworks, such as TensorFlow, PyTorch, MXNet, etc., to build and train the image classification model. A p3.2xlarge instance is also more cost-effective than a p3.8xlarge instance, which has four NVIDIA Tesla V100 GPUs, as the latter may not be necessary for a proof of concept with a small dataset. Therefore, the Machine Learning Specialist should select p3.2xlarge as the compute choice to train and achieve good accuracy on the model quickly.

References:

Amazon EC2 P3 Instances - Amazon Web Services

Image Classification - Amazon SageMaker

Convolutional Neural Networks - Amazon SageMaker

Deep Learning AMIs - Amazon Web Services

A Marketing Manager at a pet insurance company plans to launch a targeted marketing campaign on social media to acquire new customers Currently, the company has the following data in Amazon Aurora

* Profiles for all past and existing customers

* Profiles for all past and existing insured pets

* Policy-level information

* Premiums received

* Claims paid

What steps should be taken to implement a machine learning model to identify potential new customers on social media?

A.
Use regression on customer profile data to understand key characteristics of consumer segments Find similar profiles on social media.
A.
Use regression on customer profile data to understand key characteristics of consumer segments Find similar profiles on social media.
Answers
B.
Use clustering on customer profile data to understand key characteristics of consumer segments Find similar profiles on social media.
B.
Use clustering on customer profile data to understand key characteristics of consumer segments Find similar profiles on social media.
Answers
C.
Use a recommendation engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media
C.
Use a recommendation engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media
Answers
D.
Use a decision tree classifier engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media
D.
Use a decision tree classifier engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media
Answers
Suggested answer: B

Explanation:

Clustering is a machine learning technique that can group data points into clusters based on their similarity or proximity. Clustering can help discover the underlying structure and patterns in the data, as well as identify outliers or anomalies. Clustering can also be used for customer segmentation, which is the process of dividing customers into groups based on their characteristics, behaviors, preferences, or needs. Customer segmentation can help understand the key features and needs of different customer segments, as well as design and implement targeted marketing campaigns for each segment. In this case, the Marketing Manager at a pet insurance company plans to launch a targeted marketing campaign on social media to acquire new customers. To do this, the Manager can use clustering on customer profile data to understand the key characteristics of consumer segments, such as their demographics, pet types, policy preferences, premiums paid, claims made, etc. The Manager can then find similar profiles on social media, such as Facebook, Twitter, Instagram, etc., by using the cluster features as filters or keywords. The Manager can then target these potential new customers with personalized and relevant ads or offers that match their segment's needs and interests. This way, the Manager can implement a machine learning model to identify potential new customers on social media.

A company is running an Amazon SageMaker training job that will access data stored in its Amazon S3 bucket A compliance policy requires that the data never be transmitted across the internet How should the company set up the job?

A.
Launch the notebook instances in a public subnet and access the data through the public S3 endpoint
A.
Launch the notebook instances in a public subnet and access the data through the public S3 endpoint
Answers
B.
Launch the notebook instances in a private subnet and access the data through a NAT gateway
B.
Launch the notebook instances in a private subnet and access the data through a NAT gateway
Answers
C.
Launch the notebook instances in a public subnet and access the data through a NAT gateway
C.
Launch the notebook instances in a public subnet and access the data through a NAT gateway
Answers
D.
Launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint.
D.
Launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint.
Answers
Suggested answer: D

Explanation:

A private subnet is a subnet that does not have a route to the internet gateway, which means that the resources in the private subnet cannot access the internet or be accessed from the internet. An S3 VPC endpoint is a gateway endpoint that allows the resources in the VPC to access the S3 service without going through the internet. By launching the notebook instances in a private subnet and accessing the data through an S3 VPC endpoint, the company can set up the job in a secure and compliant way, as the data never leaves the AWS network and is not exposed to the internet. This can also improve the performance and reliability of the data transfer, as the traffic does not depend on the internet bandwidth or availability.

References:

Amazon VPC Endpoints - Amazon Virtual Private Cloud

Endpoints for Amazon S3 - Amazon Virtual Private Cloud

Connect to SageMaker Within your VPC - Amazon SageMaker

Working with VPCs and Subnets - Amazon Virtual Private Cloud

Total 308 questions
Go to page: of 31