ExamGecko
Home Home / Amazon / MLS-C01

Amazon MLS-C01 Practice Test - Questions Answers, Page 25

Question list
Search
Search

List of questions

Search

Related questions











A data scientist is building a linear regression model. The scientist inspects the dataset and notices that the mode of the distribution is lower than the median, and the median is lower than the mean.

Which data transformation will give the data scientist the ability to apply a linear regression model?

A.
Exponential transformation
A.
Exponential transformation
Answers
B.
Logarithmic transformation
B.
Logarithmic transformation
Answers
C.
Polynomial transformation
C.
Polynomial transformation
Answers
D.
Sinusoidal transformation
D.
Sinusoidal transformation
Answers
Suggested answer: B

Explanation:

A logarithmic transformation is a suitable data transformation for a linear regression model when the data has a skewed distribution, such as when the mode is lower than the median and the median is lower than the mean. A logarithmic transformation can reduce the skewness and make the data more symmetric and normally distributed, which are desirable properties for linear regression. A logarithmic transformation can also reduce the effect of outliers and heteroscedasticity (unequal variance) in the data. An exponential transformation would have the opposite effect of increasing the skewness and making the data more asymmetric. A polynomial transformation may not be able to capture the nonlinearity in the data and may introduce multicollinearity among the transformed variables. A sinusoidal transformation is not appropriate for data that does not have a periodic pattern.

References:

Data Transformation - Scaler Topics

Linear Regression - GeeksforGeeks

Linear Regression - Scribbr

A company is planning a marketing campaign to promote a new product to existing customers. The company has data (or past promotions that are similar. The company decides to try an experiment to send a more expensive marketing package to a smaller number of customers. The company wants to target the marketing campaign to customers who are most likely to buy the new product. The experiment requires that at least 90% of the customers who are likely to purchase the new product receive the marketing materials.

...company trains a model by using the linear learner algorithm in Amazon SageMaker. The model has a recall score of 80% and a precision of 75%.

...should the company retrain the model to meet these requirements?

A.
Set the target_recall hyperparameter to 90% Set the binaryclassrfier model_selection_critena hyperparameter to recall_at_target_precision.
A.
Set the target_recall hyperparameter to 90% Set the binaryclassrfier model_selection_critena hyperparameter to recall_at_target_precision.
Answers
B.
Set the targetprecision hyperparameter to 90%. Set the binary classifier model selection criteria hyperparameter to precision at_jarget recall.
B.
Set the targetprecision hyperparameter to 90%. Set the binary classifier model selection criteria hyperparameter to precision at_jarget recall.
Answers
C.
Use 90% of the historical data for training Set the number of epochs to 20.
C.
Use 90% of the historical data for training Set the number of epochs to 20.
Answers
D.
Set the normalize_jabel hyperparameter to true. Set the number of classes to 2.
D.
Set the normalize_jabel hyperparameter to true. Set the number of classes to 2.
Answers
Suggested answer: A

Explanation:

The best way to retrain the model to meet the requirements is to set the target_recall hyperparameter to 90% and set the binary_classifier_model_selection_criteria hyperparameter to recall_at_target_precision. This will instruct the linear learner algorithm to optimize the model for a high recall score, while maintaining a reasonable precision score.Recall is the proportion of actual positives that were identified correctly, which is important for the company's goal of reaching at least 90% of the customers who are likely to buy the new product1.Precision is the proportion of positive identifications that were actually correct, which is also relevant for the company's budget and efficiency2.By setting the target_recall to 90%, the algorithm will try to achieve a recall score of at least 90%, and by setting the binary_classifier_model_selection_criteria to recall_at_target_precision, the algorithm will select the model that has the highest recall score among those that have a precision score equal to or higher than the target precision3.The target precision is automatically set to the median of the precision scores of all the models trained in parallel4.

The other options are not correct or optimal, because they have the following drawbacks:

B: Setting the target_precision hyperparameter to 90% and setting the binary_classifier_model_selection_criteria hyperparameter to precision_at_target_recall will optimize the model for a high precision score, while maintaining a reasonable recall score.However, this is not aligned with the company's goal of reaching at least 90% of the customers who are likely to buy the new product, as precision does not reflect how well the model identifies the actual positives1.Moreover, setting the target_precision to 90% might be too high and unrealistic for the dataset, as the current precision score is only 75%4.

C: Using 90% of the historical data for training and setting the number of epochs to 20 will not necessarily improve the recall score of the model, as it does not change the optimization objective or the model selection criteria.Moreover, using more data for training might reduce the amount of data available for validation, which is needed for selecting the best model among the ones trained in parallel3.The number of epochs is also not a decisive factor for the recall score, as it depends on the learning rate, the optimizer, and the convergence of the algorithm5.

D: Setting the normalize_label hyperparameter to true and setting the number of classes to 2 will not affect the recall score of the model, as these are irrelevant hyperparameters for binary classification problems.The normalize_label hyperparameter is only applicable for regression problems, as it controls whether the label is normalized to have zero mean and unit variance3.The number of classes hyperparameter is only applicable for multiclass classification problems, as it specifies the number of output classes3.

References:

1:Classification: Precision and Recall | Machine Learning | Google for Developers

2:Precision and recall - Wikipedia

3:Linear Learner Algorithm - Amazon SageMaker

4:How linear learner works - Amazon SageMaker

5:Getting hands-on with Amazon SageMaker Linear Learner - Pluralsight

A data scientist receives a collection of insurance claim records. Each record includes a claim ID. the final outcome of the insurance claim, and the date of the final outcome.

The final outcome of each claim is a selection from among 200 outcome categories. Some claim records include only partial information. However, incomplete claim records include only 3 or 4 outcome ...gones from among the 200 available outcome categories. The collection includes hundreds of records for each outcome category. The records are from the previous 3 years.

The data scientist must create a solution to predict the number of claims that will be in each outcome category every month, several months in advance.

Which solution will meet these requirements?

A.
Perform classification every month by using supervised learning of the 20X3 outcome categories based on claim contents.
A.
Perform classification every month by using supervised learning of the 20X3 outcome categories based on claim contents.
Answers
B.
Perform reinforcement learning by using claim IDs and dates Instruct the insurance agents who submit the claim records to estimate the expected number of claims in each outcome category every month
B.
Perform reinforcement learning by using claim IDs and dates Instruct the insurance agents who submit the claim records to estimate the expected number of claims in each outcome category every month
Answers
C.
Perform forecasting by using claim IDs and dates to identify the expected number ot claims in each outcome category every month.
C.
Perform forecasting by using claim IDs and dates to identify the expected number ot claims in each outcome category every month.
Answers
D.
Perform classification by using supervised learning of the outcome categories for which partial information on claim contents is provided. Perform forecasting by using claim IDs and dates for all other outcome categories.
D.
Perform classification by using supervised learning of the outcome categories for which partial information on claim contents is provided. Perform forecasting by using claim IDs and dates for all other outcome categories.
Answers
Suggested answer: C

Explanation:

The best solution for this scenario is to perform forecasting by using claim IDs and dates to identify the expected number of claims in each outcome category every month. This solution has the following advantages:

It leverages the historical data of claim outcomes and dates to capture the temporal patterns and trends of the claims in each category1.

It does not require the claim contents or any other features to make predictions, which simplifies the data preparation and reduces the impact of missing or incomplete data2.

It can handle the high cardinality of the outcome categories, as forecasting models can output multiple values for each time point3.

It can provide predictions for several months in advance, which is useful for planning and budgeting purposes4.

The other solutions have the following drawbacks:

A: Performing classification every month by using supervised learning of the 200 outcome categories based on claim contents is not suitable, because it assumes that the claim contents are available and complete for all the records, which is not the case in this scenario2.Moreover, classification models usually output a single label for each input, which is not adequate for predicting the number of claims in each category3.Additionally, classification models do not account for the temporal aspect of the data, which is important for forecasting1.

B: Performing reinforcement learning by using claim IDs and dates and instructing the insurance agents who submit the claim records to estimate the expected number of claims in each outcome category every month is not feasible, because it requires a feedback loop between the model and the agents, which might not be available or reliable in this scenario5.Furthermore, reinforcement learning is more suitable for sequential decision making problems, where the model learns from its actions and rewards, rather than forecasting problems, where the model learns from historical data and outputs future values6.

D: Performing classification by using supervised learning of the outcome categories for which partial information on claim contents is provided and performing forecasting by using claim IDs and dates for all other outcome categories is not optimal, because it combines two different methods that might not be consistent or compatible with each other7.Also, this solution suffers from the same limitations as solution A, such as the dependency on claim contents, the inability to handle multiple outputs, and the ignorance of temporal patterns123.

References:

1:Time Series Forecasting - Amazon SageMaker

2:Handling Missing Data for Machine Learning | AWS Machine Learning Blog

3:Forecasting vs Classification: What's the Difference? | DataRobot

4:Amazon Forecast -- Time Series Forecasting Made Easy | AWS News Blog

5:Reinforcement Learning - Amazon SageMaker

6:What is Reinforcement Learning? The Complete Guide | Edureka

7:Combining Machine Learning Models | by Will Koehrsen | Towards Data Science

A retail company stores 100 GB of daily transactional data in Amazon S3 at periodic intervals. The company wants to identify the schema of the transactional data. The company also wants to perform transformations on the transactional data that is in Amazon S3.

The company wants to use a machine learning (ML) approach to detect fraud in the transformed data.

Which combination of solutions will meet these requirements with the LEAST operational overhead? {Select THREE.)

A.
Use Amazon Athena to scan the data and identify the schema.
A.
Use Amazon Athena to scan the data and identify the schema.
Answers
B.
Use AWS Glue crawlers to scan the data and identify the schema.
B.
Use AWS Glue crawlers to scan the data and identify the schema.
Answers
C.
Use Amazon Redshift to store procedures to perform data transformations
C.
Use Amazon Redshift to store procedures to perform data transformations
Answers
D.
Use AWS Glue workflows and AWS Glue jobs to perform data transformations.
D.
Use AWS Glue workflows and AWS Glue jobs to perform data transformations.
Answers
E.
Use Amazon Redshift ML to train a model to detect fraud.
E.
Use Amazon Redshift ML to train a model to detect fraud.
Answers
F.
Use Amazon Fraud Detector to train a model to detect fraud.
F.
Use Amazon Fraud Detector to train a model to detect fraud.
Answers
Suggested answer: B, D, F

Explanation:

To meet the requirements with the least operational overhead, the company should use AWS Glue crawlers, AWS Glue workflows and jobs, and Amazon Fraud Detector. AWS Glue crawlers can scan the data in Amazon S3 and identify the schema, which is then stored in the AWS Glue Data Catalog. AWS Glue workflows and jobs can perform data transformations on the data in Amazon S3 using serverless Spark or Python scripts. Amazon Fraud Detector can train a model to detect fraud using the transformed data and the company's historical fraud labels, and then generate fraud predictions using a simple API call.

Option A is incorrect because Amazon Athena is a serverless query service that can analyze data in Amazon S3 using standard SQL, but it does not perform data transformations or fraud detection.

Option C is incorrect because Amazon Redshift is a cloud data warehouse that can store and query data using SQL, but it requires provisioning and managing clusters, which adds operational overhead. Moreover, Amazon Redshift does not provide a built-in fraud detection capability.

Option E is incorrect because Amazon Redshift ML is a feature that allows users to create, train, and deploy machine learning models using SQL commands in Amazon Redshift. However, using Amazon Redshift ML would require loading the data from Amazon S3 to Amazon Redshift, which adds complexity and cost. Also, Amazon Redshift ML does not support fraud detection as a use case.

References:

AWS Glue Crawlers

AWS Glue Workflows and Jobs

Amazon Fraud Detector

A company's machine learning (ML) specialist is building a computer vision model to classify 10 different traffic signs. The company has stored 100 images of each class in Amazon S3, and the company has another 10.000 unlabeled images. All the images come from dash cameras and are a size of 224 pixels * 224 pixels. After several training runs, the model is overfitting on the training data.

Which actions should the ML specialist take to address this problem? (Select TWO.)

A.
Use Amazon SageMaker Ground Truth to label the unlabeled images
A.
Use Amazon SageMaker Ground Truth to label the unlabeled images
Answers
B.
Use image preprocessing to transform the images into grayscale images.
B.
Use image preprocessing to transform the images into grayscale images.
Answers
C.
Use data augmentation to rotate and translate the labeled images.
C.
Use data augmentation to rotate and translate the labeled images.
Answers
D.
Replace the activation of the last layer with a sigmoid.
D.
Replace the activation of the last layer with a sigmoid.
Answers
E.
Use the Amazon SageMaker k-nearest neighbors (k-NN) algorithm to label the unlabeled images.
E.
Use the Amazon SageMaker k-nearest neighbors (k-NN) algorithm to label the unlabeled images.
Answers
Suggested answer: C, E

Explanation:

Data augmentation is a technique to increase the size and diversity of the training data by applying random transformations such as rotation, translation, scaling, flipping, etc. This can help reduce overfitting and improve the generalization of the model.Data augmentation can be done using the Amazon SageMaker image classification algorithm, which supports various augmentation options such as horizontal_flip, vertical_flip, rotate, brightness, contrast, etc1

The Amazon SageMaker k-nearest neighbors (k-NN) algorithm is a supervised learning algorithm that can be used to label unlabeled data based on the similarity to the labeled data. The k-NN algorithm assigns a label to an unlabeled instance by finding the k closest labeled instances in the feature space and taking a majority vote among their labels. This can help increase the size and diversity of the training data and reduce overfitting.The k-NN algorithm can be used with the Amazon SageMaker image classification algorithm by extracting features from the images using a pre-trained model and then applying the k-NN algorithm on the feature vectors2

Using Amazon SageMaker Ground Truth to label the unlabeled images is not a good option because it is a manual and costly process that requires human annotators. Moreover, it does not address the issue of overfitting on the existing labeled data.

Using image preprocessing to transform the images into grayscale images is not a good option because it reduces the amount of information and variation in the images, which can degrade the performance of the model. Moreover, it does not address the issue of overfitting on the existing labeled data.

Replacing the activation of the last layer with a sigmoid is not a good option because it is not suitable for a multi-class classification problem. A sigmoid activation function outputs a value between 0 and 1, which can be interpreted as a probability of belonging to a single class. However, for a multi-class classification problem, the output should be a vector of probabilities that sum up to 1, which can be achieved by using a softmax activation function.

References:

1:Image classification algorithm - Amazon SageMaker

2:k-nearest neighbors (k-NN) algorithm - Amazon SageMaker

A machine learning (ML) specialist is using the Amazon SageMaker DeepAR forecasting algorithm to train a model on CPU-based Amazon EC2 On-Demand instances. The model currently takes multiple hours to train. The ML specialist wants to decrease the training time of the model.

Which approaches will meet this requirement7 (SELECT TWO )

A.
Replace On-Demand Instances with Spot Instances
A.
Replace On-Demand Instances with Spot Instances
Answers
B.
Configure model auto scaling dynamically to adjust the number of instances automatically.
B.
Configure model auto scaling dynamically to adjust the number of instances automatically.
Answers
C.
Replace CPU-based EC2 instances with GPU-based EC2 instances.
C.
Replace CPU-based EC2 instances with GPU-based EC2 instances.
Answers
D.
Use multiple training instances.
D.
Use multiple training instances.
Answers
E.
Use a pre-trained version of the model. Run incremental training.
E.
Use a pre-trained version of the model. Run incremental training.
Answers
Suggested answer: C, D

Explanation:

The best approaches to decrease the training time of the model are C and D, because they can improve the computational efficiency and parallelization of the training process. These approaches have the following benefits:

C: Replacing CPU-based EC2 instances with GPU-based EC2 instances can speed up the training of the DeepAR algorithm, as it can leverage the parallel processing power of GPUs to perform matrix operations and gradient computations faster than CPUs12.The DeepAR algorithm supports GPU-based EC2 instances such as ml.p2 and ml.p33.

D: Using multiple training instances can also reduce the training time of the DeepAR algorithm, as it can distribute the workload across multiple nodes and perform data parallelism4.The DeepAR algorithm supports distributed training with multiple CPU-based or GPU-based EC2 instances3.

The other options are not effective or relevant, because they have the following drawbacks:

A: Replacing On-Demand Instances with Spot Instances can reduce the cost of the training, but not necessarily the time, as Spot Instances are subject to interruption and availability5.Moreover, the DeepAR algorithm does not support checkpointing, which means that the training cannot resume from the last saved state if the Spot Instance is terminated3.

B: Configuring model auto scaling dynamically to adjust the number of instances automatically is not applicable, as this feature is only available for inference endpoints, not for training jobs6.

E: Using a pre-trained version of the model and running incremental training is not possible, as the DeepAR algorithm does not support incremental training or transfer learning3.The DeepAR algorithm requires a full retraining of the model whenever new data is added or the hyperparameters are changed7.

References:

1:GPU vs CPU: What Matters Most for Machine Learning? | by Louis (What's AI) Bouchard | Towards Data Science

2:How GPUs Accelerate Machine Learning Training | NVIDIA Developer Blog

3:DeepAR Forecasting Algorithm - Amazon SageMaker

4:Distributed Training - Amazon SageMaker

5:Managed Spot Training - Amazon SageMaker

6:Automatic Scaling - Amazon SageMaker

7:How the DeepAR Algorithm Works - Amazon SageMaker

A data scientist obtains a tabular dataset that contains 150 correlated features with different ranges to build a regression model. The data scientist needs to achieve more efficient model training by implementing a solution that minimizes impact on the model's performance. The data scientist decides to perform a principal component analysis (PCA) preprocessing step to reduce the number of features to a smaller set of independent features before the data scientist uses the new features in the regression model.

Which preprocessing step will meet these requirements?

A.
Use the Amazon SageMaker built-in algorithm for PCA on the dataset to transform the data
A.
Use the Amazon SageMaker built-in algorithm for PCA on the dataset to transform the data
Answers
B.
Load the data into Amazon SageMaker Data Wrangler. Scale the data with a Min Max Scaler transformation step Use the SageMaker built-in algorithm for PCA on the scaled dataset to transform the data.
B.
Load the data into Amazon SageMaker Data Wrangler. Scale the data with a Min Max Scaler transformation step Use the SageMaker built-in algorithm for PCA on the scaled dataset to transform the data.
Answers
C.
Reduce the dimensionality of the dataset by removing the features that have the highest correlation Load the data into Amazon SageMaker Data Wrangler Perform a Standard Scaler transformation step to scale the data Use the SageMaker built-in algorithm for PCA on the scaled dataset to transform the data
C.
Reduce the dimensionality of the dataset by removing the features that have the highest correlation Load the data into Amazon SageMaker Data Wrangler Perform a Standard Scaler transformation step to scale the data Use the SageMaker built-in algorithm for PCA on the scaled dataset to transform the data
Answers
D.
Reduce the dimensionality of the dataset by removing the features that have the lowest correlation. Load the data into Amazon SageMaker Data Wrangler. Perform a Min Max Scaler transformation step to scale the data. Use the SageMaker built-in algorithm for PCA on the scaled dataset to transform the data.
D.
Reduce the dimensionality of the dataset by removing the features that have the lowest correlation. Load the data into Amazon SageMaker Data Wrangler. Perform a Min Max Scaler transformation step to scale the data. Use the SageMaker built-in algorithm for PCA on the scaled dataset to transform the data.
Answers
Suggested answer: B

Explanation:

Principal component analysis (PCA) is a technique for reducing the dimensionality of datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. PCA is useful when dealing with datasets that have a large number of correlated features. However, PCA is sensitive to the scale of the features, so it is important to standardize or normalize the data before applying PCA. Amazon SageMaker provides a built-in algorithm for PCA that can be used to transform the data into a lower-dimensional representation. Amazon SageMaker Data Wrangler is a tool that allows data scientists to visually explore, clean, and prepare data for machine learning. Data Wrangler provides various transformation steps that can be applied to the data, such as scaling, encoding, imputing, etc. Data Wrangler also integrates with SageMaker built-in algorithms, such as PCA, to enable feature engineering and dimensionality reduction. Therefore, option B is the correct answer, as it involves scaling the data with a Min Max Scaler transformation step, which rescales the data to a range of [0, 1], and then using the SageMaker built-in algorithm for PCA on the scaled dataset to transform the data. Option A is incorrect, as it does not involve scaling the data before applying PCA, which can affect the results of the dimensionality reduction. Option C is incorrect, as it involves removing the features that have the highest correlation, which can lead to information loss and reduce the performance of the regression model. Option D is incorrect, as it involves removing the features that have the lowest correlation, which can also lead to information loss and reduce the performance of the regression model.References:

Principal Component Analysis (PCA) - Amazon SageMaker

Scale data with a Min Max Scaler - Amazon SageMaker Data Wrangler

Use Amazon SageMaker built-in algorithms - Amazon SageMaker Data Wrangler

A financial services company wants to automate its loan approval process by building a machine learning (ML) model. Each loan data point contains credit history from a third-party data source and demographic information about the customer. Each loan approval prediction must come with a report that contains an explanation for why the customer was approved for a loan or was denied for a loan. The company will use Amazon SageMaker to build the model.

Which solution will meet these requirements with the LEAST development effort?

A.
Use SageMaker Model Debugger to automatically debug the predictions, generate the explanation, and attach the explanation report.
A.
Use SageMaker Model Debugger to automatically debug the predictions, generate the explanation, and attach the explanation report.
Answers
B.
Use AWS Lambda to provide feature importance and partial dependence plots. Use the plots to generate and attach the explanation report.
B.
Use AWS Lambda to provide feature importance and partial dependence plots. Use the plots to generate and attach the explanation report.
Answers
C.
Use SageMaker Clarify to generate the explanation report. Attach the report to the predicted results.
C.
Use SageMaker Clarify to generate the explanation report. Attach the report to the predicted results.
Answers
D.
Use custom Amazon Cloud Watch metrics to generate the explanation report. Attach the report to the predicted results.
D.
Use custom Amazon Cloud Watch metrics to generate the explanation report. Attach the report to the predicted results.
Answers
Suggested answer: C

Explanation:

The best solution for this scenario is to use SageMaker Clarify to generate the explanation report and attach it to the predicted results. SageMaker Clarify provides tools to help explain how machine learning (ML) models make predictions using a model-agnostic feature attribution approach based on SHAP values. It can also detect and measure potential bias in the data and the model. SageMaker Clarify can generate explanation reports during data preparation, model training, and model deployment. The reports include metrics, graphs, and examples that help understand the model behavior and predictions. The reports can be attached to the predicted results using the SageMaker SDK or the SageMaker API.

The other solutions are less optimal because they require more development effort and additional services. Using SageMaker Model Debugger would require modifying the training script to save the model output tensors and writing custom rules to debug and explain the predictions. Using AWS Lambda would require writing code to invoke the ML model, compute the feature importance and partial dependence plots, and generate and attach the explanation report. Using custom Amazon CloudWatch metrics would require writing code to publish the metrics, create dashboards, and generate and attach the explanation report.

References:

Bias Detection and Model Explainability - Amazon SageMaker Clarify - AWS

Amazon SageMaker Clarify Model Explainability

Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability

GitHub - aws/amazon-sagemaker-clarify: Fairness Aware Machine Learning

An obtain relator collects the following data on customer orders: demographics, behaviors, location, shipment progress, and delivery time. A data scientist joins all the collected datasets. The result is a single dataset that includes 980 variables.

The data scientist must develop a machine learning (ML) model to identify groups of customers who are likely to respond to a marketing campaign.

Which combination of algorithms should the data scientist use to meet this requirement? (Select TWO.)

A.
Latent Dirichlet Allocation (LDA)
A.
Latent Dirichlet Allocation (LDA)
Answers
B.
K-means
B.
K-means
Answers
C.
Se mantic feg mentation
C.
Se mantic feg mentation
Answers
D.
Principal component analysis (PCA)
D.
Principal component analysis (PCA)
Answers
E.
Factorization machines (FM)
E.
Factorization machines (FM)
Answers
Suggested answer: B, D

Explanation:

The data scientist should useK-meansandprincipal component analysis (PCA)to meet this requirement. K-means is a clustering algorithm that can group customers based on their similarity in the feature space. PCA is a dimensionality reduction technique that can transform the original 980 variables into a smaller set of uncorrelated variables that capture most of the variance in the data. This can help reduce the computational cost and noise in the data, and improve the performance of the clustering algorithm.

References:

Clustering - Amazon SageMaker

Dimensionality Reduction - Amazon SageMaker

A machine learning (ML) developer for an online retailer recently uploaded a sales dataset into Amazon SageMaker Studio. The ML developer wants to obtain importance scores for each feature of the dataset. The ML developer will use the importance scores to feature engineer the dataset.

Which solution will meet this requirement with the LEAST development effort?

A.
Use SageMaker Data Wrangler to perform a Gini importance score analysis.
A.
Use SageMaker Data Wrangler to perform a Gini importance score analysis.
Answers
B.
Use a SageMaker notebook instance to perform principal component analysis (PCA).
B.
Use a SageMaker notebook instance to perform principal component analysis (PCA).
Answers
C.
Use a SageMaker notebook instance to perform a singular value decomposition analysis.
C.
Use a SageMaker notebook instance to perform a singular value decomposition analysis.
Answers
D.
Use the multicollinearity feature to perform a lasso feature selection to perform an importance scores analysis.
D.
Use the multicollinearity feature to perform a lasso feature selection to perform an importance scores analysis.
Answers
Suggested answer: A

Explanation:

SageMaker Data Wrangler is a feature of SageMaker Studio that provides an end-to-end solution for importing, preparing, transforming, featurizing, and analyzing data. Data Wrangler includes built-in analyses that help generate visualizations and data insights in a few clicks. One of the built-in analyses is the Quick Model visualization, which can be used to quickly evaluate the data and produce importance scores for each feature. A feature importance score indicates how useful a feature is at predicting a target label. The feature importance score is between [0, 1] and a higher number indicates that the feature is more important to the whole dataset. The Quick Model visualization uses a random forest model to calculate the feature importance for each feature using the Gini importance method. This method measures the total reduction in node impurity (a measure of how well a node separates the classes) that is attributed to splitting on a particular feature. The ML developer can use the Quick Model visualization to obtain the importance scores for each feature of the dataset and use them to feature engineer the dataset. This solution requires the least development effort compared to the other options.

References:

* Analyze and Visualize

* Detect multicollinearity, target leakage, and feature correlation with Amazon SageMaker Data Wrangler

Total 308 questions
Go to page: of 31