ExamGecko

Amazon MLS-C01 Practice Test - Questions Answers

Question list
Search
Search

List of questions

Search

Related questions











Question 1

Report
Export
Collapse

A Machine Learning Specialist is configuring Amazon SageMaker so multiple Data Scientists can access notebooks, train models, and deploy endpoints. To ensure the best operational performance, the Specialist needs to be able to track how often the Scientists are deploying models, GPU and CPU utilization on the deployed SageMaker endpoints, and all errors that are generated when an endpoint is invoked.

Which services are integrated with Amazon SageMaker to track this information? (Select TWO.)

A.
AWS CloudTrail
A.
AWS CloudTrail
Answers
B.
AWS Health
B.
AWS Health
Answers
C.
AWS Trusted Advisor
C.
AWS Trusted Advisor
Answers
D.
Amazon CloudWatch
D.
Amazon CloudWatch
Answers
E.
AWS Config
E.
AWS Config
Answers
Suggested answer: A, D

Explanation:

The services that are integrated with Amazon SageMaker to track the information that the Machine Learning Specialist needs are AWS CloudTrail and Amazon CloudWatch. AWS CloudTrail is a service that records the API calls and events for AWS services, including Amazon SageMaker. AWS CloudTrail can track the actions performed by the Data Scientists, such as creating notebooks, training models, and deploying endpoints. AWS CloudTrail can also provide information such as the identity of the user, the time of the action, the parameters used, and the response elements returned.AWS CloudTrail can help the Machine Learning Specialist to monitor the usage and activity of Amazon SageMaker, as well as to audit and troubleshoot any issues1Amazon CloudWatch is a service that collects and analyzes the metrics and logs for AWS services, including Amazon SageMaker. Amazon CloudWatch can track the performance and utilization of the Amazon SageMaker endpoints, such as the CPU and GPU utilization, the inference latency, the number of invocations, etc. Amazon CloudWatch can also track the errors and alarms that are generated when an endpoint is invoked, such as the model errors, the throttling errors, the HTTP errors, etc.Amazon CloudWatch can help the Machine Learning Specialist to optimize the operational performance and reliability of Amazon SageMaker, as well as to set up notifications and actions based on the metrics and logs

asked 16/09/2024
Stefano Humphries
40 questions

Question 2

Report
Export
Collapse

A Machine Learning Specialist trained a regression model, but the first iteration needs optimizing. The Specialist needs to understand whether the model is more frequently overestimating or underestimating the target.

What option can the Specialist use to determine whether it is overestimating or underestimating the target value?

A.
Root Mean Square Error (RMSE)
A.
Root Mean Square Error (RMSE)
Answers
B.
Residual plots
B.
Residual plots
Answers
C.
Area under the curve
C.
Area under the curve
Answers
D.
Confusion matrix
D.
Confusion matrix
Answers
Suggested answer: B

Explanation:

Residual plots are a model evaluation technique that can be used to understand whether a regression model is more frequently overestimating or underestimating the target. Residual plots are graphs that plot the residuals (the difference between the actual and predicted values) against the predicted values or other variables. Residual plots can help the Machine Learning Specialist to identify the patterns and trends in the residuals, such as the direction, shape, and distribution.Residual plots can also reveal the presence of outliers, heteroscedasticity, non-linearity, or other problems in the model12

To determine whether the model is overestimating or underestimating the target, the Machine Learning Specialist can use a residual plot that plots the residuals against the predicted values. This type of residual plot is also known as a prediction error plot. A prediction error plot can show the magnitude and direction of the errors made by the model. If the model is overestimating the target, the residuals will be negative, and the points will be below the zero line. If the model is underestimating the target, the residuals will be positive, and the points will be above the zero line. If the model is accurate, the residuals will be close to zero, and the points will be scattered around the zero line. A prediction error plot can also show the variance and bias of the model. If the model has high variance, the residuals will have a large spread, and the points will be far from the zero line. If the model has high bias, the residuals will have a systematic pattern, such as a curve or a slope, and the points will not be randomly distributed around the zero line.A prediction error plot can help the Machine Learning Specialist to optimize the model by adjusting the complexity, features, or parameters of the model34

The other options are not valid or suitable for determining whether the model is overestimating or underestimating the target. Root Mean Square Error (RMSE) is a model evaluation metric that measures the average magnitude of the errors made by the model. RMSE is the square root of the mean of the squared residuals. RMSE can indicate the overall accuracy and performance of the model, but it cannot show the direction or distribution of the errors.RMSE can also be influenced by outliers or extreme values, and it may not be comparable across different models or datasets5Area under the curve (AUC) is a model evaluation metric that measures the ability of the model to distinguish between the positive and negative classes. AUC is the area under the receiver operating characteristic (ROC) curve, which plots the true positive rate against the false positive rate for various classification thresholds. AUC can indicate the overall quality and performance of the model, but it is only applicable for binary classification models, not regression models. AUC cannot show the magnitude or direction of the errors made by the model. Confusion matrix is a model evaluation technique that summarizes the number of correct and incorrect predictions made by the model for each class. A confusion matrix is a table that shows the counts of true positives, false positives, true negatives, and false negatives for each class. A confusion matrix can indicate the accuracy, precision, recall, and F1-score of the model for each class, but it is only applicable for classification models, not regression models. A confusion matrix cannot show the magnitude or direction of the errors made by the model.

asked 16/09/2024
Heng Tan
30 questions

Question 3

Report
Export
Collapse

A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided.

Based on this information, which model would have the HIGHEST recall with respect to the fraudulent class?

A.
Decision tree
A.
Decision tree
Answers
B.
Linear support vector machine (SVM)
B.
Linear support vector machine (SVM)
Answers
C.
Naive Bayesian classifier
C.
Naive Bayesian classifier
Answers
D.
Single Perceptron with sigmoidal activation function
D.
Single Perceptron with sigmoidal activation function
Answers
Suggested answer: A

Explanation:

Based on the figure provided, a decision tree would have the highest recall with respect to the fraudulent class. Recall is a model evaluation metric that measures the proportion of actual positive instances that are correctly classified by the model. Recall is calculated as follows:

Recall = True Positives / (True Positives + False Negatives)

A decision tree is a type of machine learning model that can perform classification tasks by splitting the data into smaller and purer subsets based on a series of rules or conditions. A decision tree can handle both linear and non-linear data, and can capture complex patterns and interactions among the features.A decision tree can also be easily visualized and interpreted1

In this case, the data is not linearly separable, and has a clear pattern of seasonality. The fraudulent class forms a large circle in the center of the plot, while the normal class is scattered around the edges. A decision tree can use the transaction month and the age of account as the splitting criteria, and create a circular boundary that separates the fraudulent class from the normal class. A decision tree can achieve a high recall for the fraudulent class, as it can correctly identify most of the black dots as positive instances, and minimize the number of false negatives.A decision tree can also adjust the depth and complexity of the tree to balance the trade-off between recall and precision23

The other options are not valid or suitable for achieving a high recall for the fraudulent class. A linear support vector machine (SVM) is a type of machine learning model that can perform classification tasks by finding a linear hyperplane that maximizes the margin between the classes. A linear SVM can handle linearly separable data, but not non-linear data.A linear SVM cannot capture the circular pattern of the fraudulent class, and may misclassify many of the black dots as negative instances, resulting in a low recall4A naive Bayesian classifier is a type of machine learning model that can perform classification tasks by applying the Bayes' theorem and assuming conditional independence among the features. A naive Bayesian classifier can handle both linear and non-linear data, and can incorporate prior knowledge and probabilities into the model. However, a naive Bayesian classifier may not perform well when the features are correlated or dependent, as in this case.A naive Bayesian classifier may not capture the circular pattern of the fraudulent class, and may misclassify many of the black dots as negative instances, resulting in a low recall5A single perceptron with sigmoidal activation function is a type of machine learning model that can perform classification tasks by applying a weighted linear combination of the features and a non-linear activation function. A single perceptron with sigmoidal activation function can handle linearly separable data, but not non-linear data. A single perceptron with sigmoidal activation function cannot capture the circular pattern of the fraudulent class, and may misclassify many of the black dots as negative instances, resulting in a low recall.

asked 16/09/2024
Luis Campoy
41 questions

Question 4

Report
Export
Collapse

When submitting Amazon SageMaker training jobs using one of the built-in algorithms, which common parameters MUST be specified? (Select THREE.)

A.
The training channel identifying the location of training data on an Amazon S3 bucket.
A.
The training channel identifying the location of training data on an Amazon S3 bucket.
Answers
B.
The validation channel identifying the location of validation data on an Amazon S3 bucket.
B.
The validation channel identifying the location of validation data on an Amazon S3 bucket.
Answers
C.
The 1AM role that Amazon SageMaker can assume to perform tasks on behalf of the users.
C.
The 1AM role that Amazon SageMaker can assume to perform tasks on behalf of the users.
Answers
D.
Hyperparameters in a JSON array as documented for the algorithm used.
D.
Hyperparameters in a JSON array as documented for the algorithm used.
Answers
E.
The Amazon EC2 instance class specifying whether training will be run using CPU or GPU.
E.
The Amazon EC2 instance class specifying whether training will be run using CPU or GPU.
Answers
F.
The output path specifying where on an Amazon S3 bucket the trained model will persist.
F.
The output path specifying where on an Amazon S3 bucket the trained model will persist.
Answers
Suggested answer: A, C, F

Explanation:

When submitting Amazon SageMaker training jobs using one of the built-in algorithms, the common parameters that must be specified are:

The training channel identifying the location of training data on an Amazon S3 bucket. This parameter tells SageMaker where to find the input data for the algorithm and what format it is in. For example,TrainingInputMode: Filemeans that the input data is in files stored in S3.

The IAM role that Amazon SageMaker can assume to perform tasks on behalf of the users. This parameter grants SageMaker the necessary permissions to access the S3 buckets, ECR repositories, and other AWS resources needed for the training job. For example,RoleArn: arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole-20200303T150948means that SageMaker will use the specified role to run the training job.

The output path specifying where on an Amazon S3 bucket the trained model will persist. This parameter tells SageMaker where to save the model artifacts, such as the model weights and parameters, after the training job is completed. For example,OutputDataConfig: {S3OutputPath: s3://my-bucket/my-training-job}means that SageMaker will store the model artifacts in the specified S3 location.

The validation channel identifying the location of validation data on an Amazon S3 bucket is an optional parameter that can be used to provide a separate dataset for evaluating the model performance during the training process. This parameter is not required for all algorithms and can be omitted if the validation data is not available or not needed.

The hyperparameters in a JSON array as documented for the algorithm used is another optional parameter that can be used to customize the behavior and performance of the algorithm. This parameter is specific to each algorithm and can be used to tune the model accuracy, speed, complexity, and other aspects. For example,HyperParameters: {num_round: '10', objective: 'binary:logistic'}means that the XGBoost algorithm will use 10 boosting rounds and the logistic loss function for binary classification.

The Amazon EC2 instance class specifying whether training will be run using CPU or GPU is not a parameter that is specified when submitting a training job using a built-in algorithm. Instead, this parameter is specified when creating a training instance, which is a containerized environment that runs the training code and algorithm. For example,ResourceConfig: {InstanceType: ml.m5.xlarge, InstanceCount: 1, VolumeSizeInGB: 10}means that SageMaker will use one m5.xlarge instance with 10 GB of storage for the training instance.

References:

Train a Model with Amazon SageMaker

Use Amazon SageMaker Built-in Algorithms or Pre-trained Models

CreateTrainingJob - Amazon SageMaker Service

asked 16/09/2024
Angela Cappa
34 questions

Question 5

Report
Export
Collapse

A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age.

Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population.

How should the Data Scientist correct this issue?

A.
Drop all records from the dataset where age has been set to 0.
A.
Drop all records from the dataset where age has been set to 0.
Answers
B.
Replace the age field value for records with a value of 0 with the mean or median value from the dataset.
B.
Replace the age field value for records with a value of 0 with the mean or median value from the dataset.
Answers
C.
Drop the age feature from the dataset and train the model using the rest of the features.
C.
Drop the age feature from the dataset and train the model using the rest of the features.
Answers
D.
Use k-means clustering to handle missing features.
D.
Use k-means clustering to handle missing features.
Answers
Suggested answer: B

Explanation:

The best way to handle the missing values in the patient age feature is to replace them with the mean or median value from the dataset. This is a common technique for imputing missing values that preserves the overall distribution of the data and avoids introducing bias or reducing the sample size. Dropping the records or the feature would result in losing valuable information and reducing the accuracy of the model. Using k-means clustering would not be appropriate for handling missing values in a single feature, as it is a method for grouping similar data points based on multiple features.

References:

Effective Strategies to Handle Missing Values in Data Analysis

How To Handle Missing Values In Machine Learning Data With Weka

How to handle missing values in Python - Machine Learning Plus

asked 16/09/2024
Renats Fasulins
37 questions

Question 6

Report
Export
Collapse

A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL.

Which storage scheme is MOST adapted to this scenario?

A.
Store datasets as files in Amazon S3.
A.
Store datasets as files in Amazon S3.
Answers
B.
Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.
B.
Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.
Answers
C.
Store datasets as tables in a multi-node Amazon Redshift cluster.
C.
Store datasets as tables in a multi-node Amazon Redshift cluster.
Answers
D.
Store datasets as global tables in Amazon DynamoDB.
D.
Store datasets as global tables in Amazon DynamoDB.
Answers
Suggested answer: A

Explanation:

The best storage scheme for this scenario is to store datasets as files in Amazon S3. Amazon S3 is a scalable, cost-effective, and durable object storage service that can store any amount and type of data. Amazon S3 also supports querying data using SQL with Amazon Athena, a serverless interactive query service that can analyze data directly in S3. This way, the Data Science team can easily explore and analyze their datasets without having to load them into a database or a compute instance.

The other options are not as suitable for this scenario because:

Storing datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance would limit the scalability and availability of the data, as EBS volumes are only accessible within a single availability zone and have a maximum size of 16 TiB. Also, EBS volumes are more expensive than S3 buckets and require provisioning and managing EC2 instances.

Storing datasets as tables in a multi-node Amazon Redshift cluster would incur higher costs and complexity than using S3 and Athena. Amazon Redshift is a data warehouse service that is optimized for analytical queries over structured or semi-structured data. However, it requires setting up and maintaining a cluster of nodes, loading data into tables, and choosing the right distribution and sort keys for optimal performance. Moreover, Amazon Redshift charges for both storage and compute, while S3 and Athena only charge for the amount of data stored and scanned, respectively.

Storing datasets as global tables in Amazon DynamoDB would not be feasible for large amounts of data, as DynamoDB is a key-value and document database service that is designed for fast and consistent performance at any scale. However, DynamoDB has a limit of 400 KB per item and 25 GB per partition key value, which may not be enough for storing large datasets. Also, DynamoDB does not support SQL queries natively, and would require using a service like Amazon EMR or AWS Glue to run SQL queries over DynamoDB data.

References:

Amazon S3 - Cloud Object Storage

Amazon Athena -- Interactive SQL Queries for Data in Amazon S3

Amazon EBS - Amazon Elastic Block Store (EBS)

Amazon Redshift -- Data Warehouse Solution - AWS

Amazon DynamoDB -- NoSQL Cloud Database Service

asked 16/09/2024
janet phillips
36 questions

Question 7

Report
Export
Collapse

A Machine Learning Specialist is configuring automatic model tuning in Amazon SageMaker

When using the hyperparameter optimization feature, which of the following guidelines should be followed to improve optimization?

Choose the maximum number of hyperparameters supported by

A.
Amazon SageMaker to search the largest number of combinations possible
A.
Amazon SageMaker to search the largest number of combinations possible
Answers
B.
Specify a very large hyperparameter range to allow Amazon SageMaker to cover every possible value.
B.
Specify a very large hyperparameter range to allow Amazon SageMaker to cover every possible value.
Answers
C.
Use log-scaled hyperparameters to allow the hyperparameter space to be searched as quickly as possible
C.
Use log-scaled hyperparameters to allow the hyperparameter space to be searched as quickly as possible
Answers
D.
Execute only one hyperparameter tuning job at a time and improve tuning through successive rounds of experiments
D.
Execute only one hyperparameter tuning job at a time and improve tuning through successive rounds of experiments
Answers
Suggested answer: C

Explanation:

Using log-scaled hyperparameters is a guideline that can improve the automatic model tuning in Amazon SageMaker. Log-scaled hyperparameters are hyperparameters that have values that span several orders of magnitude, such as learning rate, regularization parameter, or number of hidden units. Log-scaled hyperparameters can be specified by using a log-uniform distribution, which assigns equal probability to each order of magnitude within a range. For example, a log-uniform distribution between 0.001 and 1000 can sample values such as 0.001, 0.01, 0.1, 1, 10, 100, or 1000 with equal probability. Using log-scaled hyperparameters can allow the hyperparameter optimization feature to search the hyperparameter space more efficiently and effectively, as it can explore different scales of values and avoid sampling values that are too small or too large. Using log-scaled hyperparameters can also help avoid numerical issues, such as underflow or overflow, that may occur when using linear-scaled hyperparameters.Using log-scaled hyperparameters can be done by setting the ScalingType parameter to Logarithmic when defining the hyperparameter ranges in Amazon SageMaker12

The other options are not valid or relevant guidelines for improving the automatic model tuning in Amazon SageMaker. Choosing the maximum number of hyperparameters supported by Amazon SageMaker to search the largest number of combinations possible is not a good practice, as it can increase the time and cost of the tuning job and make it harder to find the optimal values.Amazon SageMaker supports up to 20 hyperparameters for tuning, but it is recommended to choose only the most important and influential hyperparameters for the model and algorithm, and use default or fixed values for the rest3Specifying a very large hyperparameter range to allow Amazon SageMaker to cover every possible value is not a good practice, as it can result in sampling values that are irrelevant or impractical for the model and algorithm, and waste the tuning budget.It is recommended to specify a reasonable and realistic hyperparameter range based on the prior knowledge and experience of the model and algorithm, and use the results of the tuning job to refine the range if needed4Executing only one hyperparameter tuning job at a time and improving tuning through successive rounds of experiments is not a good practice, as it can limit the exploration and exploitation of the hyperparameter space and make the tuning process slower and less efficient.It is recommended to use parallelism and concurrency to run multiple training jobs simultaneously and leverage the Bayesian optimization algorithm that Amazon SageMaker uses to guide the search for the best hyperparameter values5

asked 16/09/2024
ABHIJIT GHOSH
30 questions

Question 8

Report
Export
Collapse

A large mobile network operating company is building a machine learning model to predict customers who are likely to unsubscribe from the service. The company plans to offer an incentive for these customers as the cost of churn is far greater than the cost of the incentive.

The model produces the following confusion matrix after evaluating on a test dataset of 100 customers:

Based on the model evaluation results, why is this a viable model for production?

A.
The model is 86% accurate and the cost incurred by the company as a result of false negatives is less than the false positives.
A.
The model is 86% accurate and the cost incurred by the company as a result of false negatives is less than the false positives.
Answers
B.
The precision of the model is 86%, which is less than the accuracy of the model.
B.
The precision of the model is 86%, which is less than the accuracy of the model.
Answers
C.
The model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives.
C.
The model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives.
Answers
D.
The precision of the model is 86%, which is greater than the accuracy of the model.
D.
The precision of the model is 86%, which is greater than the accuracy of the model.
Answers
Suggested answer: C

Explanation:

Based on the model evaluation results, this is a viable model for production because the model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives. The accuracy of the model is the proportion of correct predictions out of the total predictions, which can be calculated by adding the true positives and true negatives and dividing by the total number of observations. In this case, the accuracy of the model is (10 + 76) / 100 = 0.86, which means that the model correctly predicted 86% of the customers' churn status. The cost incurred by the company as a result of false positives and false negatives is the loss or damage that the company suffers when the model makes incorrect predictions. A false positive is when the model predicts that a customer will churn, but the customer actually does not churn. A false negative is when the model predicts that a customer will not churn, but the customer actually churns. In this case, the cost of a false positive is the incentive that the company offers to the customer who is predicted to churn, which is a relatively low cost. The cost of a false negative is the revenue that the company loses when the customer churns, which is a relatively high cost. Therefore, the cost of a false positive is less than the cost of a false negative, and the company would prefer to have more false positives than false negatives. The model has 10 false positives and 4 false negatives, which means that the company's cost is lower than if the model had more false negatives and fewer false positives.

asked 16/09/2024
Aleh Patskevich
48 questions

Question 9

Report
Export
Collapse

A data scientist must build a custom recommendation model in Amazon SageMaker for an online retail company. Due to the nature of the company's products, customers buy only 4-5 products every 5-10 years. So, the company relies on a steady stream of new customers. When a new customer signs up, the company collects data on the customer's preferences. Below is a sample of the data available to the data scientist.

How should the data scientist split the dataset into a training and test set for this use case?

A.
Shuffle all interaction data. Split off the last 10% of the interaction data for the test set.
A.
Shuffle all interaction data. Split off the last 10% of the interaction data for the test set.
Answers
B.
Identify the most recent 10% of interactions for each user. Split off these interactions for the test set.
B.
Identify the most recent 10% of interactions for each user. Split off these interactions for the test set.
Answers
C.
Identify the 10% of users with the least interaction data. Split off all interaction data from these users for the test set.
C.
Identify the 10% of users with the least interaction data. Split off all interaction data from these users for the test set.
Answers
D.
Randomly select 10% of the users. Split off all interaction data from these users for the test set.
D.
Randomly select 10% of the users. Split off all interaction data from these users for the test set.
Answers
Suggested answer: D

Explanation:

The best way to split the dataset into a training and test set for this use case is to randomly select 10% of the users and split off all interaction data from these users for the test set. This is because the company relies on a steady stream of new customers, so the test set should reflect the behavior of new customers who have not been seen by the model before. The other options are not suitable because they either mix old and new customers in the test set (A and B), or they bias the test set towards users with less interaction data .References:

Amazon SageMaker Developer Guide: Train and Test Datasets

Amazon Personalize Developer Guide: Preparing and Importing Data

asked 16/09/2024
adir tamam
32 questions

Question 10

Report
Export
Collapse

A financial services company wants to adopt Amazon SageMaker as its default data science environment. The company's data scientists run machine learning (ML) models on confidential financial data. The company is worried about data egress and wants an ML engineer to secure the environment.

Which mechanisms can the ML engineer use to control data egress from SageMaker? (Choose three.)

A.
Connect to SageMaker by using a VPC interface endpoint powered by AWS PrivateLink.
A.
Connect to SageMaker by using a VPC interface endpoint powered by AWS PrivateLink.
Answers
B.
Use SCPs to restrict access to SageMaker.
B.
Use SCPs to restrict access to SageMaker.
Answers
C.
Disable root access on the SageMaker notebook instances.
C.
Disable root access on the SageMaker notebook instances.
Answers
D.
Enable network isolation for training jobs and models.
D.
Enable network isolation for training jobs and models.
Answers
E.
Restrict notebook presigned URLs to specific IPs used by the company.
E.
Restrict notebook presigned URLs to specific IPs used by the company.
Answers
F.
Protect data with encryption at rest and in transit.
F.
Protect data with encryption at rest and in transit.
Answers
Suggested answer: A, D, F

Explanation:

Use AWS Key Management Service (AWS KMS) to manage encryption keys. To control data egress from SageMaker, the ML engineer can use the following mechanisms: Connect to SageMaker by using a VPC interface endpoint powered by AWS PrivateLink. This allows the ML engineer to access SageMaker services and resources without exposing the traffic to the public internet.This reduces the risk of data leakage and unauthorized access1 Enable network isolation for training jobs and models. This prevents the training jobs and models from accessing the internet or other AWS services.This ensures that the data used for training and inference is not exposed to external sources2 Protect data with encryption at rest and in transit. Use AWS Key Management Service (AWS KMS) to manage encryption keys. This enables the ML engineer to encrypt the data stored in Amazon S3 buckets, SageMaker notebook instances, and SageMaker endpoints. It also allows the ML engineer to encrypt the data in transit between SageMaker and other AWS services.This helps protect the data from unauthorized access and tampering3 The other options are not effective in controlling data egress from SageMaker: Use SCPs to restrict access to SageMaker. SCPs are used to define the maximum permissions for an organization or organizational unit (OU) in AWS Organizations.They do not control the data egress from SageMaker, but rather the access to SageMaker itself4 Disable root access on the SageMaker notebook instances. This prevents the users from installing additional packages or libraries on the notebook instances. It does not prevent the data from being transferred out of the notebook instances. Restrict notebook presigned URLs to specific IPs used by the company. This limits the access to the notebook instances from certain IP addresses. It does not prevent the data from being transferred out of the notebook instances. References: 1:Amazon SageMaker Interface VPC Endpoints (AWS PrivateLink) - Amazon SageMaker 2:Network Isolation - Amazon SageMaker 3:Encrypt Data at Rest and in Transit - Amazon SageMaker 4: Using Service Control Policies - AWS Organizations : Disable Root Access - Amazon SageMaker : Create a Presigned Notebook Instance URL - Amazon SageMaker

asked 16/09/2024
Katherine Messick
37 questions
Total 308 questions
Go to page: of 31