Google Professional Machine Learning Engineer Practice Test - Questions Answers, Page 9
List of questions
Question 81

One of your models is trained using data provided by a third-party data broker. The data broker does not reliably notify you of formatting changes in the data. You want to make your model training pipeline more robust to issues like this. What should you do?
Explanation:
TensorFlow Data Validation (TFDV) is a library that helps you understand, validate, and monitor your data for machine learning. It can automatically detect and report schema anomalies, such as missing features, new features, or different data types, in your data. It can also generate descriptive statistics and data visualizations to help you explore and debug your data. TFDV can be integrated with your model training pipeline to ensure data quality and consistency throughout the machine learning lifecycle.Reference:
TensorFlow Data Validation
Data Validation | TensorFlow
Data Validation | Machine Learning Crash Course | Google Developers
Question 82

You work for a company that is developing a new video streaming platform. You have been asked to create a recommendation system that will suggest the next video for a user to watch. After a review by an AI Ethics team, you are approved to start development. Each video asset in your company's catalog has useful metadata (e.g., content type, release date, country), but you do not have any historical user event data. How should you build the recommendation system for the first version of the product?
Explanation:
The best option for building a recommendation system without any user event data is to use simple heuristics based on content metadata. This is a type of content-based filtering, which recommends items that are similar to the ones that the user has interacted with or selected, based on their attributes. For example, if a user selects a comedy movie from the US released in 2020, the system can recommend other comedy movies from the US released in 2020 or nearby years. This approach does not require any machine learning, but it can leverage the existing metadata of the videos to provide relevant recommendations. It also allows the system to start collecting user event data, such as views, likes, ratings, etc., which can be used to train a more sophisticated machine learning model in the future, such as a collaborative filtering model or a hybrid model that combines content and collaborative information.Reference:
Recommendation Systems
Content-Based Filtering
Collaborative Filtering
Hybrid Recommender Systems: A Systematic Literature Review
Question 83

You recently built the first version of an image segmentation model for a self-driving car. After deploying the model, you observe a decrease in the area under the curve (AUC) metric. When analyzing the video recordings, you also discover that the model fails in highly congested traffic but works as expected when there is less traffic. What is the most likely reason for this result?
Explanation:
The most likely reason for the observed result is that the model is overfitting in areas with less traffic and underfitting in areas with more traffic. Overfitting means that the model learns the specific patterns and noise in the training data, but fails to generalize well to new and unseen data. Underfitting means that the model is not able to capture the complexity and variability of the data, and performs poorly on both training and test data. In this case, the model might have learned to segment the images well when there is less traffic, but it might not have enough data or features to handle the more challenging scenarios when there is more traffic. This could lead to a decrease in the AUC metric, which measures the ability of the model to distinguish between different classes. AUC is a suitable metric for this classification model, as it is not affected by class imbalance or threshold selection. The other options are not likely to be the reason for the result, as they are not related to the traffic density. Too much data representing congested areas would not cause the model to fail in those areas, but rather help the model learn better. Gradients vanishing or exploding is a problem that occurs during the training process, not after the deployment, and it affects the whole model, not specific scenarios.Reference:
Image Segmentation: U-Net For Self Driving Cars
Intelligent Semantic Segmentation for Self-Driving Vehicles Using Deep Learning
Sharing Pixelopolis, a self-driving car demo from Google I/O built with TensorFlow Lite
Google Cloud launches machine learning engineer certification
Google Professional Machine Learning Engineer Certification
Professional ML Engineer Exam Guide
Preparing for Google Cloud Certification: Machine Learning Engineer Professional Certificate
Question 84

You are developing an ML model to predict house prices. While preparing the data, you discover that an important predictor variable, distance from the closest school, is often missing and does not have high variance. Every instance (row) in your data is important. How should you handle the missing data?
Explanation:
The best option for handling missing data in this case is to predict the missing values using linear regression. Linear regression is a supervised learning technique that can be used to estimate the relationship between a continuous target variable and one or more predictor variables. In this case, the target variable is the distance from the closest school, and the predictor variables are the other features in the dataset, such as house size, location, number of rooms, etc. By fitting a linear regression model on the data that has no missing values, we can then use the model to predict the missing values for the distance from the closest school feature. This way, we can preserve all the instances in the dataset and avoid introducing bias or reducing variance. The other options are not suitable for handling missing data in this case, because:
Deleting the rows that have missing values would reduce the size of the dataset and potentially lose important information. Since every instance is important, we want to keep as much data as possible.
Applying feature crossing with another column that does not have missing values would create a new feature that combines the values of two existing features. This might increase the complexity of the model and introduce noise or multicollinearity. It would not solve the problem of missing values, as the new feature would still have missing values whenever the distance from the closest school feature is missing.
Replacing the missing values with zeros would distort the distribution of the feature and introduce bias. It would also imply that the houses with missing values are located at the same distance from the closest school, which is unlikely to be true. A zero value might also be outside the range of the feature, as the distance from the closest school is unlikely to be exactly zero for any house.Reference:
Linear Regression
Imputation of missing values
Google Cloud launches machine learning engineer certification
Google Professional Machine Learning Engineer Certification
Professional ML Engineer Exam Guide
Preparing for Google Cloud Certification: Machine Learning Engineer Professional Certificate
Question 85

You are an ML engineer responsible for designing and implementing training pipelines for ML models. You need to create an end-to-end training pipeline for a TensorFlow model. The TensorFlow model will be trained on several terabytes of structured data. You need the pipeline to include data quality checks before training and model quality checks after training but prior to deployment. You want to minimize development time and the need for infrastructure maintenance. How should you build and orchestrate your training pipeline?
Explanation:
The best option for creating and orchestrating an end-to-end training pipeline for a TensorFlow model is to use TensorFlow Extended (TFX) and standard TFX components, and deploy the pipeline to Vertex AI Pipelines. TFX is an end-to-end platform for deploying production ML pipelines, which consists of several built-in components that cover the entire ML lifecycle, from data ingestion and validation, to model training and evaluation, to model deployment and monitoring. TFX also supports custom components and integrations with other Google Cloud services, such as BigQuery, Dataflow, and Cloud Storage. Vertex AI Pipelines is a fully managed service that allows you to run TFX pipelines on Google Cloud, without having to worry about infrastructure provisioning, scaling, or maintenance. Vertex AI Pipelines also provides a user-friendly interface to monitor and manage your pipelines, as well as tools to track and compare experiments. The other options are not as suitable for creating and orchestrating an end-to-end training pipeline for a TensorFlow model, because:
Creating the pipeline using Kubeflow Pipelines domain-specific language (DSL) and predefined Google Cloud components would require more development time and effort, as Kubeflow Pipelines DSL is not as expressive or compatible with TensorFlow as TFX. Predefined Google Cloud components might not cover all the stages of the ML lifecycle, and might not be optimized for TensorFlow models.
Orchestrating the pipeline using Kubeflow Pipelines deployed on Google Kubernetes Engine would require more infrastructure maintenance, as Kubeflow Pipelines is not a fully managed service, and you would have to provision and manage your own Kubernetes cluster. This would also incur more costs, as you would have to pay for the cluster resources, regardless of the pipeline usage.Reference:
TFX | ML Production Pipelines | TensorFlow
Vertex AI Pipelines | Google Cloud
Kubeflow Pipelines | Google Cloud
Google Cloud launches machine learning engineer certification
Google Professional Machine Learning Engineer Certification
Professional ML Engineer Exam Guide
Question 86

Your data science team has requested a system that supports scheduled model retraining, Docker containers, and a service that supports autoscaling and monitoring for online prediction requests. Which platform components should you choose for this system?
Explanation:
Vertex AI Pipelines and AI Platform Prediction are the platform components that best suit the requirements of the data science team. Vertex AI Pipelines is a service that allows you to orchestrate and automate your machine learning workflows using pipelines. Pipelines are portable and scalable ML workflows that are based on containers. You can use Vertex AI Pipelines to schedule model retraining, use custom containers, and integrate with other Google Cloud services. AI Platform Prediction is a service that allows you to host your trained models and serve online predictions. You can use AI Platform Prediction to deploy models trained on Vertex AI or elsewhere, and benefit from features such as autoscaling, monitoring, logging, and explainability.Reference:
Vertex AI Pipelines
AI Platform Prediction
Question 87

While monitoring your model training's GPU utilization, you discover that you have a native synchronous implementation. The training data is split into multiple files. You want to reduce the execution time of your input pipeline. What should you do?
Explanation:
Parallel interleave is a technique that can improve the performance of the input pipeline by reading and processing data from multiple files in parallel. This can reduce the idle time of the GPU and speed up the training process. Parallel interleave can be implemented using the tf.data.experimental.parallel_interleave () function in TensorFlow, which takes a map function that returns a dataset for each input element, and a cycle length that determines how many input elements are processed concurrently. Parallel interleave can also handle different file sizes and processing times by using a block length argument that controls how many consecutive elements are produced from each input element before switching to another input element. For more information about parallel interleave and how to use it, see the following references:
How to use parallel_interleave in TensorFlow
Better performance with the tf.data API
Question 88

Your data science team is training a PyTorch model for image classification based on a pre-trained RestNet model. You need to perform hyperparameter tuning to optimize for several parameters. What should you do?
Explanation:
AI Platform supports hyperparameter tuning for PyTorch models using custom containers. This allows you to use any Python dependencies and libraries that are not included in the pre-built AI Platform Training runtime versions. You can also use a pre-trained model such as ResNet as a base for your custom model. To run a hyperparameter tuning job on AI Platform using custom containers, you need to do the following steps:
Create a Dockerfile that defines the container image for your training application. The Dockerfile should install PyTorch and any other dependencies, copy your training code and configuration files, and set the entrypoint for the container.
Build the container image and push it to Container Registry or another accessible registry.
Create a YAML file that defines the configuration for your hyperparameter tuning job. The YAML file should specify the container image URI, the training input and output paths, the hyperparameters to tune, the metric to optimize, and the tuning algorithm and budget.
Submit the hyperparameter tuning job to AI Platform using the gcloud command-line tool or the AI Platform Training API.
Hyperparameter tuning overview
Using custom containers
PyTorch on AI Platform Training
Question 89

You have a large corpus of written support cases that can be classified into 3 separate categories: Technical Support, Billing Support, or Other Issues. You need to quickly build, test, and deploy a service that will automatically classify future written requests into one of the categories. How should you configure the pipeline?
Explanation:
AutoML Natural Language is a service that allows you to quickly build, test and deploy natural language processing (NLP) models without needing to have expertise in NLP or machine learning. You can use it to train a classifier on your corpus of written support cases, and then use the AutoML API to perform classification on new requests. Once the model is trained, it can be deployed as a REST API. This allows the classifier to be integrated into your pipeline and be easily consumed by other systems.
Question 90

You need to quickly build and train a model to predict the sentiment of customer reviews with custom categories without writing code. You do not have enough data to train a model from scratch. The resulting model should have high predictive performance. Which service should you use?
Explanation:
AutoML Natural Language is a service that allows you to build and train custom natural language models without writing code. You can use AutoML Natural Language to perform sentiment analysis with custom categories, such as positive, negative, or neutral. You can also use pre-trained models or transfer learning to leverage existing knowledge and reduce the amount of data required to train a model from scratch. AutoML Natural Language provides a user-friendly interface and a powerful AutoML engine that optimizes your model for high predictive performance.
Cloud Natural Language API is a service that provides pre-trained models for common natural language tasks, such as sentiment analysis, entity analysis, and syntax analysis. However, it does not allow you to customize the categories or use your own data for training.
AI Hub pre-made Jupyter Notebooks are interactive documents that contain code, text, and visualizations for various machine learning scenarios. However, they require some coding skills and data preparation to use them effectively.
AI Platform Training built-in algorithms are pre-configured machine learning algorithms that you can use to train models on AI Platform. However, they do not support sentiment analysis as a natural language task.
AutoML Natural Language documentation
Cloud Natural Language API documentation
AI Hub documentation
AI Platform Training documentation
Question