ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 17

Question list
Search
Search

List of questions

Search

Related questions











You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets. You use on-demand pricing for BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don't get slots to execute their query and you need to correct this. You'd like to avoid introducing new projects to your account.

What should you do?

A.
Convert your batch BQ queries into interactive BQ queries.
A.
Convert your batch BQ queries into interactive BQ queries.
Answers
B.
Create an additional project to overcome the 2K on-demand per-project quota.
B.
Create an additional project to overcome the 2K on-demand per-project quota.
Answers
C.
Switch to flat-rate pricing and establish a hierarchical priority model for your projects.
C.
Switch to flat-rate pricing and establish a hierarchical priority model for your projects.
Answers
D.
Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.
D.
Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.
Answers
Suggested answer: C

Explanation:

Reference https://cloud.google.com/blog/products/gcp/busting-12-myths-about-bigquery

You have an Apache Kafka Cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins.

What should you do?

A.
Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
A.
Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
Answers
B.
Deploy a Kafka cluster on GCE VM Instances with the PubSub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
B.
Deploy a Kafka cluster on GCE VM Instances with the PubSub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
Answers
C.
Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Source connector. Use a Dataflow job to read fron PubSub and write to GCS.
C.
Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Source connector. Use a Dataflow job to read fron PubSub and write to GCS.
Answers
D.
Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Sink connector. Use a Dataflow job to read fron PubSub and write to GCS.
D.
Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Sink connector. Use a Dataflow job to read fron PubSub and write to GCS.
Answers
Suggested answer: A

You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.

What should you do?

A.
Increase the size of your parquet files to ensure them to be 1 GB minimum.
A.
Increase the size of your parquet files to ensure them to be 1 GB minimum.
Answers
B.
Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
B.
Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
Answers
C.
Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
C.
Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
Answers
D.
Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
D.
Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
Answers
Suggested answer: C

Explanation:

To optimize the performance of a complex Spark job on Dataproc that heavily relies on shufflingoperations, and given the cost constraints of using preemptible VMs, switching from HDDs toSSDs and using HDFS as an intermediate storage layer can significantly improve performance.Here's why option C is the best choice:Performance of SSDs:SSDs provide much faster read and write speeds compared to HDDs, which is crucial forperformance-intensive operations like shuffling in Spark jobs.Using SSDs can reduce I/O bottlenecks during the shuffle phase of your Spark job, improvingoverall job performance.Intermediate Storage with HDFS:Copying data from Google Cloud Storage (GCS) to HDFS for intermediate storage can reducelatency compared to reading directly from GCS.HDFS provides better locality and faster data access within the Dataproc cluster, which cansignificantly improve the efficiency of shuffling and other I/O operations.

Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data).

What should you do?

A.
Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.
A.
Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.
Answers
B.
Add a tryÖ catch block to your DoFn that transforms the data, extract erroneous rows from logs.
B.
Add a tryÖ catch block to your DoFn that transforms the data, extract erroneous rows from logs.
Answers
C.
Add a tryÖ catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.
C.
Add a tryÖ catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.
Answers
D.
Add a tryÖ catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.
D.
Add a tryÖ catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.
Answers
Suggested answer: C

You're training a model to predict housing prices based on an available dataset with real estate properties. Your plan is to train a fully connected neural net, and you've discovered that the dataset contains latitude and longtitude of the property. Real estate professionals have told you that the location of the property is highly influential on price, so you'd like to engineer a feature that incorporates this physical dependency.

What should you do?

A.
Provide latitude and longtitude as input vectors to your neural net.
A.
Provide latitude and longtitude as input vectors to your neural net.
Answers
B.
Create a numeric column from a feature cross of latitude and longtitude.
B.
Create a numeric column from a feature cross of latitude and longtitude.
Answers
C.
Create a feature cross of latitude and longtitude, bucketize at the minute level and use L1 regularization during optimization.
C.
Create a feature cross of latitude and longtitude, bucketize at the minute level and use L1 regularization during optimization.
Answers
D.
Create a feature cross of latitude and longtitude, bucketize it at the minute level and use L2 regularization during optimization.
D.
Create a feature cross of latitude and longtitude, bucketize it at the minute level and use L2 regularization during optimization.
Answers
Suggested answer: B

Explanation:

Feature crosses combine multiple features into a single feature that captures the interactionbetween them. For location data, a feature cross of latitude and longitude can capture spatialdependencies that affect housing prices.This approach allows the neural network to learn complex patterns related to geographiclocation more effectively than using raw latitude and longitude values. Numerical Representation:Converting the feature cross into a numeric column simplifies the input for the neural networkand can improve the model's ability to learn from the data.This method ensures that the model can leverage the combined information from both latitudeand longitude in a meaningful way.Model Training:Using a numeric column for the feature cross helps in regularizing the model and preventsoverfitting, which is crucial for achieving good generalization on unseen data.To engineer a feature that incorporates the physical dependency of location on housing pricesfor a neural network, creating a numeric column from a feature cross of latitude and longitudeis the most effective approach. Here's why option B is the best choice

You are deploying MariaDB SQL databases on GCE VM Instances and need to configure monitoring and alerting. You want to collect metrics including network connections, disk IO and replication status from MariaDB with minimal development effort and use StackDriver for dashboards and alerts.

What should you do?

A.
Install the OpenCensus Agent and create a custom metric collection application with a StackDriver exporter.
A.
Install the OpenCensus Agent and create a custom metric collection application with a StackDriver exporter.
Answers
B.
Place the MariaDB instances in an Instance Group with a Health Check.
B.
Place the MariaDB instances in an Instance Group with a Health Check.
Answers
C.
Install the StackDriver Logging Agent and configure fluentd in_tail plugin to read MariaDB logs.
C.
Install the StackDriver Logging Agent and configure fluentd in_tail plugin to read MariaDB logs.
Answers
D.
Install the StackDriver Agent and configure the MySQL plugin.
D.
Install the StackDriver Agent and configure the MySQL plugin.
Answers
Suggested answer: C

You work for a bank. You have a labelled dataset that contains information on already granted loan application and whether these applications have been defaulted. You have been asked to train a model to predict default rates for credit applicants.

What should you do?

A.
Increase the size of the dataset by collecting additional data.
A.
Increase the size of the dataset by collecting additional data.
Answers
B.
Train a linear regression to predict a credit default risk score.
B.
Train a linear regression to predict a credit default risk score.
Answers
C.
Remove the bias from the data and collect applications that have been declined loans.
C.
Remove the bias from the data and collect applications that have been declined loans.
Answers
D.
Match loan applicants with their social profiles to enable feature engineering.
D.
Match loan applicants with their social profiles to enable feature engineering.
Answers
Suggested answer: B

You need to migrate a 2TB relational database to Google Cloud Platform. You do not have the resources to significantly refactor the application that uses this database and cost to operate is of primary concern.

Which service do you select for storing and serving your data?

A.
Cloud Spanner
A.
Cloud Spanner
Answers
B.
Cloud Bigtable
B.
Cloud Bigtable
Answers
C.
Cloud Firestore
C.
Cloud Firestore
Answers
D.
Cloud SQL
D.
Cloud SQL
Answers
Suggested answer: D

You're using Bigtable for a real-time application, and you have a heavy load that is a mix of read and writes. You've recently identified an additional use case and need to perform hourly an analytical job to calculate certain statistics across the whole database. You need to ensure both the reliability of your production application as well as the analytical workload.

What should you do?

A.
Export Bigtable dump to GCS and run your analytical job on top of the exported files.
A.
Export Bigtable dump to GCS and run your analytical job on top of the exported files.
Answers
B.
Add a second cluster to an existing instance with a multi-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
B.
Add a second cluster to an existing instance with a multi-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
Answers
C.
Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
C.
Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
Answers
D.
Increase the size of your existing cluster twice and execute your analytics workload on your new resized cluster.
D.
Increase the size of your existing cluster twice and execute your analytics workload on your new resized cluster.
Answers
Suggested answer: B

You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery. The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to

BigQuery for analysis. Which job type and transforms should this pipeline use?

A.
Batch job, PubSubIO, side-inputs
A.
Batch job, PubSubIO, side-inputs
Answers
B.
Streaming job, PubSubIO, JdbcIO, side-outputs
B.
Streaming job, PubSubIO, JdbcIO, side-outputs
Answers
C.
Streaming job, PubSubIO, BigQueryIO, side-inputs
C.
Streaming job, PubSubIO, BigQueryIO, side-inputs
Answers
D.
Streaming job, PubSubIO, BigQueryIO, side-outputs
D.
Streaming job, PubSubIO, BigQueryIO, side-outputs
Answers
Suggested answer: C
Total 372 questions
Go to page: of 38