ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 23

Question list
Search
Search

List of questions

Search

Related questions











You receive data files in CSV format monthly from a third party. You need to cleanse this data, but every third month the schema of the files changes. Your requirements for implementing these transformations include:

Executing the transformations on a schedule

Enabling non-developer analysts to modify transformations

Providing a graphical tool for designing transformations

What should you do?

A.
Use Cloud Dataprep to build and maintain the transformation recipes, and execute them on a scheduled basis
A.
Use Cloud Dataprep to build and maintain the transformation recipes, and execute them on a scheduled basis
Answers
B.
Load each month's CSV data into BigQuery, and write a SQL query to transform the data to a standard schema. Merge the transformed tables together with a SQL query
B.
Load each month's CSV data into BigQuery, and write a SQL query to transform the data to a standard schema. Merge the transformed tables together with a SQL query
Answers
C.
Help the analysts write a Cloud Dataflow pipeline in Python to perform the transformation. The Python code should be stored in a revision control system and modified as the incoming data's schema changes
C.
Help the analysts write a Cloud Dataflow pipeline in Python to perform the transformation. The Python code should be stored in a revision control system and modified as the incoming data's schema changes
Answers
D.
Use Apache Spark on Cloud Dataproc to infer the schema of the CSV file before creating a Dataframe. Then implement the transformations in Spark SQL before writing the data out to Cloud Storage and loading into BigQuery
D.
Use Apache Spark on Cloud Dataproc to infer the schema of the CSV file before creating a Dataframe. Then implement the transformations in Spark SQL before writing the data out to Cloud Storage and loading into BigQuery
Answers
Suggested answer: A

Explanation:

Names of columns

Order of columns

Column data types

Data type format

Example rows of data

A dataset associated with a target is expected to conform to the requirements of the schema. Where there are differences between target schema and dataset schema, a validation indicator (or schema tag) is displayed.

https://cloud.google.com/dataprep/docs/html/Overview-of-RapidTarget_136155049

These primary tool in use, and the data format is Optimized Row Columnar (ORC). All ORC files have been successfully copied to a Cloud Storage bucket. You need to replicate some data to the cluster's local Hadoop Distributed File

System (HDFS) to maximize performance. What are two ways to start using Hive in Cloud Dataproc? (Choose two.)

A.
Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Mount the Hive tables locally.
A.
Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Mount the Hive tables locally.
Answers
B.
Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally.
B.
Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally.
Answers
C.
Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS.
C.
Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS.
Answers
D.
Leverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables.Replicate external Hive tables to the native ones.
D.
Leverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables.Replicate external Hive tables to the native ones.
Answers
E.
Load the ORC files into BigQuery. Leverage BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables. Replicate external Hive tables to the native ones.
E.
Load the ORC files into BigQuery. Leverage BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables. Replicate external Hive tables to the native ones.
Answers
Suggested answer: B, C

You are working on a linear regression model on BigQuery ML to predict a customer's likelihood of purchasing your company's products. Your model uses a city name variable as a key predictive component in order to train and serve the model your data must be organized in columns. You want to prepare your data using the least amount of coding while maintaining the predictable variables.

What should you do?

A.
Use SQL in BigQuery to transform the stale column using a one-hot encoding method, and make each city a column with binary values.
A.
Use SQL in BigQuery to transform the stale column using a one-hot encoding method, and make each city a column with binary values.
Answers
B.
Create a new view with BigQuery that does not include a column which city information.
B.
Create a new view with BigQuery that does not include a column which city information.
Answers
C.
Cloud Data Fusion to assign each city to a region that is labeled as 1, 2 3, 4, or 5, and then use that number to represent the city in the model.
C.
Cloud Data Fusion to assign each city to a region that is labeled as 1, 2 3, 4, or 5, and then use that number to represent the city in the model.
Answers
D.
Use TensorFlow to create a categorical variable with a vocabulary list. Create the vocabulary file and upload that as part of your model to BigQuery ML.
D.
Use TensorFlow to create a categorical variable with a vocabulary list. Create the vocabulary file and upload that as part of your model to BigQuery ML.
Answers
Suggested answer: C

You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times.

Which service should you use to manage the execution of these jobs?

A.
Cloud Scheduler
A.
Cloud Scheduler
Answers
B.
Cloud Dataflow
B.
Cloud Dataflow
Answers
C.
Cloud Functions
C.
Cloud Functions
Answers
D.
Cloud Composer
D.
Cloud Composer
Answers
Suggested answer: A

You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose?

A.
Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.
A.
Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.
Answers
B.
Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
B.
Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
Answers
C.
Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions.Integrate the package tracking applications with this function.
C.
Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions.Integrate the package tracking applications with this function.
Answers
D.
Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.
D.
Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.
Answers
Suggested answer: A

You are migrating your data warehouse to BigQuery. You have migrated all of your data into tables in a dataset. Multiple users from your organization will be using the dat a. They should only see certain tables based on their team membership. How should you set user permissions?

A.
Assign the users/groups data viewer access at the table level for each table
A.
Assign the users/groups data viewer access at the table level for each table
Answers
B.
Create SQL views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the SQL views
B.
Create SQL views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the SQL views
Answers
C.
Create authorized views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the authorized views
C.
Create authorized views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the authorized views
Answers
D.
Create authorized views for each team in datasets created for each team. Assign the authorized views data viewer access to the dataset in which the data resides. Assign the users/groups data viewer access to the datasets in which the authorized views reside
D.
Create authorized views for each team in datasets created for each team. Assign the authorized views data viewer access to the dataset in which the data resides. Assign the users/groups data viewer access to the datasets in which the authorized views reside
Answers
Suggested answer: A

You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary dat a. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do?

A.
Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory
A.
Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory
Answers
B.
Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS
B.
Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS
Answers
C.
Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up
C.
Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up
Answers
D.
Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage
D.
Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage
Answers
Suggested answer: A

You work for an advertising company, and you've developed a Spark ML model to predict click- through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?

A.
Use Cloud ML Engine for training existing Spark ML models
A.
Use Cloud ML Engine for training existing Spark ML models
Answers
B.
Rewrite your models on TensorFlow, and start using Cloud ML Engine
B.
Rewrite your models on TensorFlow, and start using Cloud ML Engine
Answers
C.
Use Cloud Dataproc for training existing Spark ML models, but start reading data directly from BigQuery
C.
Use Cloud Dataproc for training existing Spark ML models, but start reading data directly from BigQuery
Answers
D.
Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery
D.
Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery
Answers
Suggested answer: C

Explanation:

https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml

You work for a global shipping company. You want to train a model on 40 TB of data to predict which ships in each geographic region are likely to cause delivery delays on any given day. The model will be based on multiple attributes collected from multiple sources. Telemetry data, including location in GeoJSON format, will be pulled from each ship and loaded every hour. You want to have a dashboard that shows how many and which ships are likely to cause delays within a region. You want to use a storage solution that has native functionality for prediction and geospatial processing. Which storage solution should you use?

A.
BigQuery
A.
BigQuery
Answers
B.
Cloud Bigtable
B.
Cloud Bigtable
Answers
C.
Cloud Datastore
C.
Cloud Datastore
Answers
D.
Cloud SQL for PostgreSQL
D.
Cloud SQL for PostgreSQL
Answers
Suggested answer: A

You operate an IoT pipeline built around Apache Kafka that normally receives around 5000 messages per second. You want to use Google Cloud Platform to create an alert as soon as the moving average over 1 hour drops below 4000 messages per second. What should you do?

A.
Consume the stream of data in Cloud Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
A.
Consume the stream of data in Cloud Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
Answers
B.
Consume the stream of data in Cloud Dataflow using Kafka IO. Set a fixed time window of 1 hour.Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
B.
Consume the stream of data in Cloud Dataflow using Kafka IO. Set a fixed time window of 1 hour.Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
Answers
C.
Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to Cloud Bigtable. Use Cloud Scheduler to run a script every hour that counts the number of rows created in Cloud Bigtable in the last hour. If that number falls below 4000, send an alert.
C.
Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to Cloud Bigtable. Use Cloud Scheduler to run a script every hour that counts the number of rows created in Cloud Bigtable in the last hour. If that number falls below 4000, send an alert.
Answers
D.
Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five minutes that counts the number of rows created in BigQuery in the last hour. If that number falls below 4000, send an alert.
D.
Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five minutes that counts the number of rows created in BigQuery in the last hour. If that number falls below 4000, send an alert.
Answers
Suggested answer: C
Total 372 questions
Go to page: of 38