ExamGecko
Home Home / Google / Professional Machine Learning Engineer

Google Professional Machine Learning Engineer Practice Test - Questions Answers

Question list
Search
Search

List of questions

Search

Related questions











You are training a TensorFlow model on a structured data set with 100 billion records stored in several CSV files. You need to improve the input/output execution performance. What should you do?

A.
Load the data into BigQuery and read the data from BigQuery.
A.
Load the data into BigQuery and read the data from BigQuery.
Answers
B.
Load the data into Cloud Bigtable, and read the data from Bigtable
B.
Load the data into Cloud Bigtable, and read the data from Bigtable
Answers
C.
Convert the CSV files into shards of TFRecords, and store the data in Cloud Storage
C.
Convert the CSV files into shards of TFRecords, and store the data in Cloud Storage
Answers
D.
Convert the CSV files into shards of TFRecords, and store the data in the Hadoop Distributed File System (HDFS)
D.
Convert the CSV files into shards of TFRecords, and store the data in the Hadoop Distributed File System (HDFS)
Answers
Suggested answer: C

Explanation:

The input/output execution performance of a TensorFlow model depends on how efficiently the model can read and process the data from the data source. Reading and processing data from CSV files can be slow and inefficient, especially if the data is large and distributed. Therefore, to improve the input/output execution performance, one should use a more suitable data format and storage system.

One of the best options for improving the input/output execution performance is to convert the CSV files into shards of TFRecords, and store the data in Cloud Storage. TFRecord is a binary data format that can store a sequence of serialized TensorFlow examples. TFRecord has several advantages over CSV, such as:

Faster data loading: TFRecord can be read and processed faster than CSV, as it avoids the overhead of parsing and decoding the text data.TFRecord also supports compression and checksums, which can reduce the data size and ensure data integrity1

Better performance: TFRecord can improve the performance of the model, as it allows the model to access the data in a sequential and streaming manner, and leverage the tf.data API to build efficient data pipelines.TFRecord also supports sharding and interleaving, which can increase the parallelism and throughput of the data processing2

Easier integration: TFRecord can integrate seamlessly with TensorFlow, as it is the native data format for TensorFlow.TFRecord also supports various types of data, such as images, text, audio, and video, and can store the data schema and metadata along with the data3

Cloud Storage is a scalable and reliable object storage service that can store any amount of data. Cloud Storage has several advantages over other storage systems, such as:

High availability: Cloud Storage can provide high availability and durability for the data, as it replicates the data across multiple regions and zones, and supports versioning and lifecycle management.Cloud Storage also offers various storage classes, such as Standard, Nearline, Coldline, and Archive, to meet different performance and cost requirements4

Low latency: Cloud Storage can provide low latency and high bandwidth for the data, as it supports HTTP and HTTPS protocols, and integrates with other Google Cloud services, such as AI Platform, Dataflow, and BigQuery.Cloud Storage also supports resumable uploads and downloads, and parallel composite uploads, which can improve the data transfer speed and reliability5

Easy access: Cloud Storage can provide easy access and management for the data, as it supports various tools and libraries, such as gsutil, Cloud Console, and Cloud Storage Client Libraries. Cloud Storage also supports fine-grained access control and encryption, which can ensure the data security and privacy.

The other options are not as effective or feasible. Loading the data into BigQuery and reading the data from BigQuery is not recommended, as BigQuery is mainly designed for analytical queries on large-scale data, and does not support streaming or real-time data processing. Loading the data into Cloud Bigtable and reading the data from Bigtable is not ideal, as Cloud Bigtable is mainly designed for low-latency and high-throughput key-value operations on sparse and wide tables, and does not support complex data types or schemas. Converting the CSV files into shards of TFRecords and storing the data in the Hadoop Distributed File System (HDFS) is not optimal, as HDFS is not natively supported by TensorFlow, and requires additional configuration and dependencies, such as Hadoop, Spark, or Beam.

You have deployed multiple versions of an image classification model on Al Platform. You want to monitor the performance of the model versions overtime. How should you perform this comparison?

A.
Compare the loss performance for each model on a held-out dataset.
A.
Compare the loss performance for each model on a held-out dataset.
Answers
B.
Compare the loss performance for each model on the validation data
B.
Compare the loss performance for each model on the validation data
Answers
C.
Compare the receiver operating characteristic (ROC) curve for each model using the What-lf Tool
C.
Compare the receiver operating characteristic (ROC) curve for each model using the What-lf Tool
Answers
D.
Compare the mean average precision across the models using the Continuous Evaluation feature
D.
Compare the mean average precision across the models using the Continuous Evaluation feature
Answers
Suggested answer: D

Explanation:

The performance of an image classification model can be measured by various metrics, such as accuracy, precision, recall, F1-score, and mean average precision (mAP).These metrics can be calculated based on the confusion matrix, which compares the predicted labels and the true labels of the images1

One of the best ways to monitor the performance of multiple versions of an image classification model on AI Platform is to compare the mean average precision across the models using the Continuous Evaluation feature. Mean average precision is a metric that summarizes the precision and recall of a model across different confidence thresholds and classes. Mean average precision is especially useful for multi-class and multi-label image classification problems, where the model has to assign one or more labels to each image from a set of possible labels.Mean average precision can range from 0 to 1, where a higher value indicates a better performance2

Continuous Evaluation is a feature of AI Platform that allows you to automatically evaluate the performance of your deployed models using online prediction requests and responses. Continuous Evaluation can help you monitor the quality and consistency of your models over time, and detect any issues or anomalies that may affect the model performance.Continuous Evaluation can also provide various evaluation metrics and visualizations, such as accuracy, precision, recall, F1-score, ROC curve, and confusion matrix, for different types of models, such as classification, regression, and object detection3

To compare the mean average precision across the models using the Continuous Evaluation feature, you need to do the following steps:

Enable the online prediction logging for each model version that you want to evaluate.This will allow AI Platform to collect the prediction requests and responses from your models and store them in BigQuery4

Create an evaluation job for each model version that you want to evaluate. This will allow AI Platform to compare the predicted labels and the true labels of the images, and calculate the evaluation metrics, such as mean average precision. You need to specify the BigQuery table that contains the prediction logs, the data schema, the label column, and the evaluation interval.

View the evaluation results for each model version on the AI Platform Models page in the Google Cloud console. You can see the mean average precision and other metrics for each model version over time, and compare them using charts and tables. You can also filter the results by different classes and confidence thresholds.

The other options are not as effective or feasible. Comparing the loss performance for each model on a held-out dataset or on the validation data is not a good idea, as the loss function may not reflect the actual performance of the model on the online prediction data, and may vary depending on the choice of the loss function and the optimization algorithm. Comparing the receiver operating characteristic (ROC) curve for each model using the What-If Tool is not possible, as the What-If Tool does not support image data or multi-class classification problems.

Your team trained and tested a DNN regression model with good results. Six months after deployment, the model is performing poorly due to a change in the distribution of the input data. How should you address the input differences in production?

A.
Create alerts to monitor for skew, and retrain the model.
A.
Create alerts to monitor for skew, and retrain the model.
Answers
B.
Perform feature selection on the model, and retrain the model with fewer features
B.
Perform feature selection on the model, and retrain the model with fewer features
Answers
C.
Retrain the model, and select an L2 regularization parameter with a hyperparameter tuning service
C.
Retrain the model, and select an L2 regularization parameter with a hyperparameter tuning service
Answers
D.
Perform feature selection on the model, and retrain the model on a monthly basis with fewer features
D.
Perform feature selection on the model, and retrain the model on a monthly basis with fewer features
Answers
Suggested answer: A

Explanation:

The performance of a DNN regression model can degrade over time due to a change in the distribution of the input data. This phenomenon is known as data drift or concept drift, and it can affect the accuracy and reliability of the model predictions.Data drift can be caused by various factors, such as seasonal changes, population shifts, market trends, or external events1

To address the input differences in production, one should create alerts to monitor for skew, and retrain the model. Skew is a measure of how much the input data in production differs from the input data used for training the model. Skew can be detected by comparing the statistics and distributions of the input features in the training and production data, such as mean, standard deviation, histogram, or quantiles.Alerts can be set up to notify the model developers or operators when the skew exceeds a certain threshold, indicating a significant change in the input data2

When an alert is triggered, the model should be retrained with the latest data that reflects the current distribution of the input features. Retraining the model can help the model adapt to the new data and improve its performance. Retraining the model can be done manually or automatically, depending on the frequency and severity of the data drift.Retraining the model can also involve updating the model architecture, hyperparameters, or optimization algorithm, if necessary3

The other options are not as effective or feasible. Performing feature selection on the model and retraining the model with fewer features is not a good idea, as it may reduce the expressiveness and complexity of the model, and ignore some important features that may affect the output. Retraining the model and selecting an L2 regularization parameter with a hyperparameter tuning service is not relevant, as L2 regularization is a technique to prevent overfitting, not data drift. Retraining the model on a monthly basis with fewer features is not optimal, as it may not capture the timely changes in the input data, and may compromise the model performance.

You manage a team of data scientists who use a cloud-based backend system to submit training jobs. This system has become very difficult to administer, and you want to use a managed service instead. The data scientists you work with use many different frameworks, including Keras, PyTorch, theano. Scikit-team, and custom libraries. What should you do?

A.
Use the Al Platform custom containers feature to receive training jobs using any framework
A.
Use the Al Platform custom containers feature to receive training jobs using any framework
Answers
B.
Configure Kubeflow to run on Google Kubernetes Engine and receive training jobs through TFJob
B.
Configure Kubeflow to run on Google Kubernetes Engine and receive training jobs through TFJob
Answers
C.
Create a library of VM images on Compute Engine; and publish these images on a centralized repository
C.
Create a library of VM images on Compute Engine; and publish these images on a centralized repository
Answers
D.
Set up Slurm workload manager to receive jobs that can be scheduled to run on your cloud infrastructure.
D.
Set up Slurm workload manager to receive jobs that can be scheduled to run on your cloud infrastructure.
Answers
Suggested answer: A

Explanation:

A cloud-based backend system is a system that runs on a cloud platform and provides services or resources to other applications or users.A cloud-based backend system can be used to submit training jobs, which are tasks that involve training a machine learning model on a given dataset using a specific framework and configuration1

However, a cloud-based backend system can also have some drawbacks, such as:

High maintenance: A cloud-based backend system may require a lot of administration and management, such as provisioning, scaling, monitoring, and troubleshooting the cloud resources and services.This can be time-consuming and costly, and may distract from the core business objectives2

Low flexibility: A cloud-based backend system may not support all the frameworks and libraries that the data scientists need to use for their training jobs.This can limit the choices and capabilities of the data scientists, and affect the quality and performance of their models3

Poor integration: A cloud-based backend system may not integrate well with other cloud services or tools that the data scientists need to use for their machine learning workflows, such as data processing, model deployment, or model monitoring. This can create compatibility and interoperability issues, and reduce the efficiency and productivity of the data scientists.

Therefore, it may be better to use a managed service instead of a cloud-based backend system to submit training jobs. A managed service is a service that is provided and operated by a third-party provider, and offers various benefits, such as:

Low maintenance: A managed service handles the administration and management of the cloud resources and services, and abstracts away the complexity and details of the underlying infrastructure.This can save time and money, and allow the data scientists to focus on their core tasks2

High flexibility: A managed service can support multiple frameworks and libraries that the data scientists need to use for their training jobs, and allow them to customize and configure their training environments and parameters.This can enhance the choices and capabilities of the data scientists, and improve the quality and performance of their models3

Easy integration: A managed service can integrate seamlessly with other cloud services or tools that the data scientists need to use for their machine learning workflows, and provide a unified and consistent interface and experience. This can solve the compatibility and interoperability issues, and increase the efficiency and productivity of the data scientists.

One of the best options for using a managed service to submit training jobs is to use the AI Platform custom containers feature to receive training jobs using any framework. AI Platform is a Google Cloud service that provides a platform for building, deploying, and managing machine learning models. AI Platform supports various machine learning frameworks, such as TensorFlow, PyTorch, scikit-learn, and XGBoost, and provides various features, such as hyperparameter tuning, distributed training, online prediction, and model monitoring.

The AI Platform custom containers feature allows the data scientists to use any framework or library that they want for their training jobs, and package their training application and dependencies as a Docker container image. The data scientists can then submit their training jobs to AI Platform, and specify the container image and the training parameters. AI Platform will run the training jobs on the cloud infrastructure, and handle the scaling, logging, and monitoring of the training jobs. The data scientists can also use the AI Platform features to optimize, deploy, and manage their models.

The other options are not as suitable or feasible. Configuring Kubeflow to run on Google Kubernetes Engine and receive training jobs through TFJob is not ideal, as Kubeflow is mainly designed for TensorFlow-based training jobs, and does not support other frameworks or libraries. Creating a library of VM images on Compute Engine and publishing these images on a centralized repository is not optimal, as Compute Engine is a low-level service that requires a lot of administration and management, and does not provide the features and integrations of AI Platform. Setting up Slurm workload manager to receive jobs that can be scheduled to run on your cloud infrastructure is not relevant, as Slurm is a tool for managing and scheduling jobs on a cluster of nodes, and does not provide a managed service for training jobs.

You are developing a Kubeflow pipeline on Google Kubernetes Engine. The first step in the pipeline is to issue a query against BigQuery. You plan to use the results of that query as the input to the next step in your pipeline. You want to achieve this in the easiest way possible. What should you do?

A.
Use the BigQuery console to execute your query and then save the query results Into a new BigQuery table.
A.
Use the BigQuery console to execute your query and then save the query results Into a new BigQuery table.
Answers
B.
Write a Python script that uses the BigQuery API to execute queries against BigQuery Execute this script as the first step in your Kubeflow pipeline
B.
Write a Python script that uses the BigQuery API to execute queries against BigQuery Execute this script as the first step in your Kubeflow pipeline
Answers
C.
Use the Kubeflow Pipelines domain-specific language to create a custom component that uses the Python BigQuery client library to execute queries
C.
Use the Kubeflow Pipelines domain-specific language to create a custom component that uses the Python BigQuery client library to execute queries
Answers
D.
Locate the Kubeflow Pipelines repository on GitHub Find the BigQuery Query Component, copy that component's URL, and use it to load the component into your pipeline. Use the component to execute queries against BigQuery
D.
Locate the Kubeflow Pipelines repository on GitHub Find the BigQuery Query Component, copy that component's URL, and use it to load the component into your pipeline. Use the component to execute queries against BigQuery
Answers
Suggested answer: D

Explanation:

Kubeflow is an open source platform for developing, orchestrating, deploying, and running scalable and portable machine learning workflows on Kubernetes. Kubeflow Pipelines is a component of Kubeflow that allows you to build and manage end-to-end machine learning pipelines using a graphical user interface or a Python-based domain-specific language (DSL).Kubeflow Pipelines can help you automate and orchestrate your machine learning workflows, and integrate with various Google Cloud services and tools1

One of the Google Cloud services that you can use with Kubeflow Pipelines is BigQuery, which is a serverless, scalable, and cost-effective data warehouse that allows you to run fast and complex queries on large-scale data.BigQuery can help you analyze and prepare your data for machine learning, and store and manage your machine learning models2

To execute a query against BigQuery as the first step in your Kubeflow pipeline, and use the results of that query as the input to the next step in your pipeline, the easiest way to do that is to use the BigQuery Query Component, which is a pre-built component that you can find in the Kubeflow Pipelines repository on GitHub. The BigQuery Query Component allows you to run a SQL query on BigQuery, and output the results as a table or a file. You can use the component's URL to load the component into your pipeline, and specify the query and the output parameters.You can then use the output of the component as the input to the next step in your pipeline, such as a data processing or a model training step3

The other options are not as easy or feasible. Using the BigQuery console to execute your query and then save the query results into a new BigQuery table is not a good idea, as it does not integrate with your Kubeflow pipeline, and requires manual intervention and duplication of data. Writing a Python script that uses the BigQuery API to execute queries against BigQuery is not ideal, as it requires writing custom code and handling authentication and error handling. Using the Kubeflow Pipelines DSL to create a custom component that uses the Python BigQuery client library to execute queries is not optimal, as it requires creating and packaging a Docker container image for the component, and testing and debugging the component.

You are developing ML models with Al Platform for image segmentation on CT scans. You frequently update your model architectures based on the newest available research papers, and have to rerun training on the same dataset to benchmark their performance. You want to minimize computation costs and manual intervention while having version control for your code. What should you do?

A.
Use Cloud Functions to identify changes to your code in Cloud Storage and trigger a retraining job
A.
Use Cloud Functions to identify changes to your code in Cloud Storage and trigger a retraining job
Answers
B.
Use the gcloud command-line tool to submit training jobs on Al Platform when you update your code
B.
Use the gcloud command-line tool to submit training jobs on Al Platform when you update your code
Answers
C.
Use Cloud Build linked with Cloud Source Repositories to trigger retraining when new code is pushed to the repository
C.
Use Cloud Build linked with Cloud Source Repositories to trigger retraining when new code is pushed to the repository
Answers
D.
Create an automated workflow in Cloud Composer that runs daily and looks for changes in code in Cloud Storage using a sensor.
D.
Create an automated workflow in Cloud Composer that runs daily and looks for changes in code in Cloud Storage using a sensor.
Answers
Suggested answer: C

Explanation:

Developing ML models with AI Platform for image segmentation on CT scans requires a lot of computation and experimentation, as image segmentation is a complex and challenging task that involves assigning a label to each pixel in an image.Image segmentation can be used for various medical applications, such as tumor detection, organ segmentation, or lesion localization1

To minimize the computation costs and manual intervention while having version control for the code, one should use Cloud Build linked with Cloud Source Repositories to trigger retraining when new code is pushed to the repository. Cloud Build is a service that executes your builds on Google Cloud Platform infrastructure.Cloud Build can import source code from Cloud Source Repositories, Cloud Storage, GitHub, or Bitbucket, execute a build to your specifications, and produce artifacts such as Docker containers or Java archives2

Cloud Build allows you to set up automated triggers that start a build when changes are pushed to a source code repository.You can configure triggers to filter the changes based on the branch, tag, or file path3

Cloud Source Repositories is a service that provides fully managed private Git repositories on Google Cloud Platform. Cloud Source Repositories allows you to store, manage, and track your code using the Git version control system.You can also use Cloud Source Repositories to connect to other Google Cloud services, such as Cloud Build, Cloud Functions, or Cloud Run4

To use Cloud Build linked with Cloud Source Repositories to trigger retraining when new code is pushed to the repository, you need to do the following steps:

Create a Cloud Source Repository for your code, and push your code to the repository.You can use the Cloud SDK, Cloud Console, or Cloud Source Repositories API to create and manage your repository5

Create a Cloud Build trigger for your repository, and specify the build configuration and the trigger settings. You can use the Cloud SDK, Cloud Console, or Cloud Build API to create and manage your trigger.

Specify the steps of the build in a YAML or JSON file, such as installing the dependencies, running the tests, building the container image, and submitting the training job to AI Platform. You can also use the Cloud Build predefined or custom build steps to simplify your build configuration.

Push your new code to the repository, and the trigger will start the build automatically. You can monitor the status and logs of the build using the Cloud SDK, Cloud Console, or Cloud Build API.

The other options are not as easy or feasible. Using Cloud Functions to identify changes to your code in Cloud Storage and trigger a retraining job is not ideal, as Cloud Functions has limitations on the memory, CPU, and execution time, and does not provide a user interface for managing and tracking your builds. Using the gcloud command-line tool to submit training jobs on AI Platform when you update your code is not optimal, as it requires manual intervention and does not leverage the benefits of Cloud Build and its integration with Cloud Source Repositories. Creating an automated workflow in Cloud Composer that runs daily and looks for changes in code in Cloud Storage using a sensor is not relevant, as Cloud Composer is mainly designed for orchestrating complex workflows across multiple systems, and does not provide a version control system for your code.

Your organization's call center has asked you to develop a model that analyzes customer sentiments in each call. The call center receives over one million calls daily, and data is stored in Cloud Storage. The data collected must not leave the region in which the call originated, and no Personally Identifiable Information (Pll) can be stored or analyzed. The data science team has a third-party tool for visualization and access which requires a SQL ANSI-2011 compliant interface. You need to select components for data processing and for analytics. How should the data pipeline be designed?

A.
1 = Dataflow, 2 = BigQuery
A.
1 = Dataflow, 2 = BigQuery
Answers
B.
1 = Pub/Sub, 2 = Datastore
B.
1 = Pub/Sub, 2 = Datastore
Answers
C.
1 = Dataflow, 2 = Cloud SQL
C.
1 = Dataflow, 2 = Cloud SQL
Answers
D.
1 = Cloud Function, 2 = Cloud SQL
D.
1 = Cloud Function, 2 = Cloud SQL
Answers
Suggested answer: A

Explanation:

A data pipeline is a set of steps or processes that move data from one or more sources to one or more destinations, usually for the purpose of analysis, transformation, or storage.A data pipeline can be designed using various components, such as data sources, data processing tools, data storage systems, and data analytics tools1

To design a data pipeline for analyzing customer sentiments in each call, one should consider the following requirements and constraints:

The call center receives over one million calls daily, and data is stored in Cloud Storage. This implies that the data is large, unstructured, and distributed, and requires a scalable and efficient data processing tool that can handle various types of data formats, such as audio, text, or image.

The data collected must not leave the region in which the call originated, and no Personally Identifiable Information (Pll) can be stored or analyzed. This implies that the data is sensitive and subject to data privacy and compliance regulations, and requires a secure and reliable data storage system that can enforce data encryption, access control, and regional policies.

The data science team has a third-party tool for visualization and access which requires a SQL ANSI-2011 compliant interface. This implies that the data analytics tool is external and independent of the data pipeline, and requires a standard and compatible data interface that can support SQL queries and operations.

One of the best options for selecting components for data processing and for analytics is to use Dataflow for data processing and BigQuery for analytics. Dataflow is a fully managed service for executing Apache Beam pipelines for data processing, such as batch or stream processing, extract-transform-load (ETL), or data integration.BigQuery is a serverless, scalable, and cost-effective data warehouse that allows you to run fast and complex queries on large-scale data23

Using Dataflow and BigQuery has several advantages for this use case:

Dataflow can process large and unstructured data from Cloud Storage in a parallel and distributed manner, and apply various transformations, such as converting audio to text, extracting sentiment scores, or anonymizing PII. Dataflow can also handle both batch and stream processing, which can enable real-time or near-real-time analysis of the call data.

BigQuery can store and analyze the processed data from Dataflow in a secure and reliable way, and enforce data encryption, access control, and regional policies. BigQuery can also support SQL ANSI-2011 compliant interface, which can enable the data science team to use their third-party tool for visualization and access. BigQuery can also integrate with various Google Cloud services and tools, such as AI Platform, Data Studio, or Looker.

Dataflow and BigQuery can work seamlessly together, as they are both part of the Google Cloud ecosystem, and support various data formats, such as CSV, JSON, Avro, or Parquet. Dataflow and BigQuery can also leverage the benefits of Google Cloud infrastructure, such as scalability, performance, and cost-effectiveness.

The other options are not as suitable or feasible. Using Pub/Sub for data processing and Datastore for analytics is not ideal, as Pub/Sub is mainly designed for event-driven and asynchronous messaging, not data processing, and Datastore is mainly designed for low-latency and high-throughput key-value operations, not analytics. Using Cloud Function for data processing and Cloud SQL for analytics is not optimal, as Cloud Function has limitations on the memory, CPU, and execution time, and does not support complex data processing, and Cloud SQL is a relational database service that may not scale well for large-scale data. Using Cloud Composer for data processing and Cloud SQL for analytics is not relevant, as Cloud Composer is mainly designed for orchestrating complex workflows across multiple systems, not data processing, and Cloud SQL is a relational database service that may not scale well for large-scale data.

You work for an online retail company that is creating a visual search engine. You have set up an end-to-end ML pipeline on Google Cloud to classify whether an image contains your company's product. Expecting the release of new products in the near future, you configured a retraining functionality in the pipeline so that new data can be fed into your ML models. You also want to use Al Platform's continuous evaluation service to ensure that the models have high accuracy on your test data set. What should you do?

A.
Keep the original test dataset unchanged even if newer products are incorporated into retraining
A.
Keep the original test dataset unchanged even if newer products are incorporated into retraining
Answers
B.
Extend your test dataset with images of the newer products when they are introduced to retraining
B.
Extend your test dataset with images of the newer products when they are introduced to retraining
Answers
C.
Replace your test dataset with images of the newer products when they are introduced to retraining.
C.
Replace your test dataset with images of the newer products when they are introduced to retraining.
Answers
D.
Update your test dataset with images of the newer products when your evaluation metrics drop below a pre-decided threshold.
D.
Update your test dataset with images of the newer products when your evaluation metrics drop below a pre-decided threshold.
Answers
Suggested answer: B

Explanation:

The test dataset is used to evaluate the performance of the ML model on unseen data. It should reflect the distribution of the data that the model will encounter in production. Therefore, if the retraining data includes new products, the test dataset should also be extended with images of those products to ensure that the model can generalize well to them. Keeping the original test dataset unchanged or replacing it entirely with images of the new products would not capture the diversity of the data that the model needs to handle. Updating the test dataset only when the evaluation metrics drop below a threshold would be reactive rather than proactive, and might result in poor user experience if the model fails to recognize the new products.Reference:

Continuous evaluation documentation

Preparing and using test sets

You are responsible for building a unified analytics environment across a variety of on-premises data marts. Your company is experiencing data quality and security challenges when integrating data across the servers, caused by the use of a wide range of disconnected tools and temporary solutions. You need a fully managed, cloud-native data integration service that will lower the total cost of work and reduce repetitive work. Some members on your team prefer a codeless interface for building Extract, Transform, Load (ETL) process. Which service should you use?

A.
Dataflow
A.
Dataflow
Answers
B.
Dataprep
B.
Dataprep
Answers
C.
Apache Flink
C.
Apache Flink
Answers
D.
Cloud Data Fusion
D.
Cloud Data Fusion
Answers
Suggested answer: D

Explanation:

Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. It provides a graphical interface to increase time efficiency and reduce complexity, and allows users to easily create and explore data pipelines using a code-free, point and click visual interface. Cloud Data Fusion also supports a broad range of data sources and formats, including on-premises data marts, and ensures data quality and security by using built-in transformation capabilities and Cloud Data Loss Prevention. Cloud Data Fusion lowers the total cost of ownership by handling performance, scalability, availability, security, and compliance needs automatically.Reference:

Cloud Data Fusion documentation

Cloud Data Fusion overview

You want to rebuild your ML pipeline for structured data on Google Cloud. You are using PySpark to conduct data transformations at scale, but your pipelines are taking over 12 hours to run. To speed up development and pipeline run time, you want to use a serverless tool and SQL syntax. You have already moved your raw data into Cloud Storage. How should you build the pipeline on Google Cloud while meeting the speed and processing requirements?

A.
Use Data Fusion's GUI to build the transformation pipelines, and then write the data into BigQuery
A.
Use Data Fusion's GUI to build the transformation pipelines, and then write the data into BigQuery
Answers
B.
Convert your PySpark into SparkSQL queries to transform the data and then run your pipeline on Dataproc to write the data into BigQuery.
B.
Convert your PySpark into SparkSQL queries to transform the data and then run your pipeline on Dataproc to write the data into BigQuery.
Answers
C.
Ingest your data into Cloud SQL convert your PySpark commands into SQL queries to transform the data, and then use federated queries from BigQuery for machine learning
C.
Ingest your data into Cloud SQL convert your PySpark commands into SQL queries to transform the data, and then use federated queries from BigQuery for machine learning
Answers
D.
Ingest your data into BigQuery using BigQuery Load, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table
D.
Ingest your data into BigQuery using BigQuery Load, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table
Answers
Suggested answer: D

Explanation:

BigQuery is a serverless, scalable, and cost-effective data warehouse that allows users to run SQL queries on large volumes of data. BigQuery Load is a tool that can ingest data from Cloud Storage into BigQuery tables. BigQuery SQL is a dialect of SQL that supports many of the same functions and operations as PySpark, such as window functions, aggregate functions, joins, and subqueries. By using BigQuery Load and BigQuery SQL, you can rebuild your ML pipeline for structured data on Google Cloud without having to manage any servers or clusters, and with faster performance and lower cost than using PySpark on Dataproc. You can also use BigQuery ML to create and evaluate ML models using SQL commands.Reference:

BigQuery documentation

BigQuery Load documentation

BigQuery SQL reference

BigQuery ML documentation

Total 285 questions
Go to page: of 29