ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 19

Question list
Search
Search

List of questions

Search

Related questions











You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other.

You want to use managed services where possible, and the pipeline will run every day. Which tool should you use?

A.
cron
A.
cron
Answers
B.
Cloud Composer
B.
Cloud Composer
Answers
C.
Cloud Scheduler
C.
Cloud Scheduler
Answers
D.
Workflow Templates on Cloud Dataproc
D.
Workflow Templates on Cloud Dataproc
Answers
Suggested answer: B

You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work in progress on your clusters. What should you do?

A.
Increase the cluster size with more non-preemptible workers.
A.
Increase the cluster size with more non-preemptible workers.
Answers
B.
Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
B.
Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
Answers
C.
Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.
C.
Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.
Answers
D.
Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.
D.
Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.
Answers
Suggested answer: D

Explanation:

Reference https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/flex

You work for a shipping company that uses handheld scanners to read shipping labels. Your company has strict data privacy standards that require scanners to only transmit recipients' personally identifiable information (PII) to analytics systems, which violates user privacy rules. You want to quickly build a scalable solution using cloud-native managed services to prevent exposure of PII to the analytics systems. What should you do?

A.
Create an authorized view in BigQuery to restrict access to tables with sensitive data.
A.
Create an authorized view in BigQuery to restrict access to tables with sensitive data.
Answers
B.
Install a third-party data validation tool on Compute Engine virtual machines to check the incoming data for sensitive information.
B.
Install a third-party data validation tool on Compute Engine virtual machines to check the incoming data for sensitive information.
Answers
C.
Use Stackdriver logging to analyze the data passed through the total pipeline to identify transactions that may contain sensitive information.
C.
Use Stackdriver logging to analyze the data passed through the total pipeline to identify transactions that may contain sensitive information.
Answers
D.
Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.
D.
Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.
Answers
Suggested answer: D

You have developed three data processing jobs. One executes a Cloud Dataflow pipeline that transforms data uploaded to Cloud Storage and writes results to BigQuery. The second ingests data from on-premises servers and uploads it to Cloud Storage. The third is a Cloud Dataflow pipeline that gets information from third-party data providers and uploads the information to Cloud Storage. You need to be able to schedule and monitor the execution of these three workflows and manually execute them when needed. What should you do?

A.
Create a Direct Acyclic Graph in Cloud Composer to schedule and monitor the jobs.
A.
Create a Direct Acyclic Graph in Cloud Composer to schedule and monitor the jobs.
Answers
B.
Use Stackdriver Monitoring and set up an alert with a Webhook notification to trigger the jobs.
B.
Use Stackdriver Monitoring and set up an alert with a Webhook notification to trigger the jobs.
Answers
C.
Develop an App Engine application to schedule and request the status of the jobs using GCP API calls.
C.
Develop an App Engine application to schedule and request the status of the jobs using GCP API calls.
Answers
D.
Set up cron jobs in a Compute Engine instance to schedule and monitor the pipelines using GCP API calls.
D.
Set up cron jobs in a Compute Engine instance to schedule and monitor the pipelines using GCP API calls.
Answers
Suggested answer: D

You have Cloud Functions written in Node.js that pull messages from Cloud Pub/Sub and send the data to BigQuery. You observe that the message processing rate on the Pub/Sub topic is orders of magnitude higher than anticipated, but there is no error logged in Stackdriver Log Viewer. What are the two most likely causes of this problem? Choose 2 answers.

A.
Publisher throughput quota is too small.
A.
Publisher throughput quota is too small.
Answers
B.
Total outstanding messages exceed the 10-MB maximum.
B.
Total outstanding messages exceed the 10-MB maximum.
Answers
C.
Error handling in the subscriber code is not handling run-time errors properly.
C.
Error handling in the subscriber code is not handling run-time errors properly.
Answers
D.
The subscriber code cannot keep up with the messages.
D.
The subscriber code cannot keep up with the messages.
Answers
E.
The subscriber code does not acknowledge the messages that it pulls.
E.
The subscriber code does not acknowledge the messages that it pulls.
Answers
Suggested answer: C, D

You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt dat a. What should you do?

A.
Add a SideInput that returns a Boolean if the element is corrupt.
A.
Add a SideInput that returns a Boolean if the element is corrupt.
Answers
B.
Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
B.
Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
Answers
C.
Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
C.
Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
Answers
D.
Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.
D.
Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.
Answers
Suggested answer: B

You have historical data covering the last three years in BigQuery and a data pipeline that delivers new data to BigQuery daily. You have noticed that when the Data Science team runs a query filtered on a date column and limited to 30ñ90 days of data, the query scans the entire table. You also noticed that your bill is increasing more quickly than you expected. You want to resolve the issue as cost-effectively as possible while maintaining the ability to conduct SQL queries.

What should you do?

A.
Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.
A.
Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.
Answers
B.
Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.
B.
Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.
Answers
C.
Modify your pipeline to maintain the last 30ñ90 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.
C.
Modify your pipeline to maintain the last 30ñ90 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.
Answers
D.
Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.
D.
Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.
Answers
Suggested answer: C

You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?

A.
Deploy small Kafka clusters in your data centers to buffer events.
A.
Deploy small Kafka clusters in your data centers to buffer events.
Answers
B.
Have the data acquisition devices publish data to Cloud Pub/Sub.
B.
Have the data acquisition devices publish data to Cloud Pub/Sub.
Answers
C.
Establish a Cloud Interconnect between all remote data centers and Google.
C.
Establish a Cloud Interconnect between all remote data centers and Google.
Answers
D.
Write a Cloud Dataflow pipeline that aggregates all data in session windows.
D.
Write a Cloud Dataflow pipeline that aggregates all data in session windows.
Answers
Suggested answer: B

You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home. You need to interpret customer voice commands and issue an order to the backend systems. Which solutions should you choose?

A.
Cloud Speech-to-Text API
A.
Cloud Speech-to-Text API
Answers
B.
Cloud Natural Language API
B.
Cloud Natural Language API
Answers
C.
Dialogflow Enterprise Edition
C.
Dialogflow Enterprise Edition
Answers
D.
Cloud AutoML Natural Language
D.
Cloud AutoML Natural Language
Answers
Suggested answer: C

Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?

A.
Cloud Dataflow
A.
Cloud Dataflow
Answers
B.
Cloud Composer
B.
Cloud Composer
Answers
C.
Cloud Dataprep
C.
Cloud Dataprep
Answers
D.
Cloud Dataproc
D.
Cloud Dataproc
Answers
Suggested answer: D
Total 372 questions
Go to page: of 38