ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 27

Question list
Search
Search

List of questions

Search

Related questions











You've migrated a Hadoop job from an on-premises cluster to Dataproc and Good Storage. Your Spark job is a complex analytical workload fiat consists of many shuffling operations, and initial data are parquet toes (on average 200-400 MB size each) You see some degradation in performance after the migration to Dataproc so you'd like to optimize for it. Your organization is very cost-sensitive so you'd Idee to continue using Dataproc on preemptibles (with 2 non-preemptibles workers only) for this workload. What should you do?

A.
Switch from HODs to SSDs override the preemptible VMs configuration to increase the boot disk size
A.
Switch from HODs to SSDs override the preemptible VMs configuration to increase the boot disk size
Answers
B.
Increase the see of your parquet files to ensure them to be 1 GB minimum
B.
Increase the see of your parquet files to ensure them to be 1 GB minimum
Answers
C.
Switch to TFRecords format (appr 200 MB per We) instead of parquet files
C.
Switch to TFRecords format (appr 200 MB per We) instead of parquet files
Answers
D.
Switch from HDDs to SSDs. copy initial data from Cloud Storage to Hadoop Distributed File System (HDFS) run the Spark job and copy results back to Cloud Storage
D.
Switch from HDDs to SSDs. copy initial data from Cloud Storage to Hadoop Distributed File System (HDFS) run the Spark job and copy results back to Cloud Storage
Answers
Suggested answer: A

Your company currently runs a large on-premises cluster using Spark Hive and Hadoop Distributed File System (HDFS) in a colocation facility. The duster is designed to support peak usage on the system, however, many jobs are batch n nature, and usage of the cluster fluctuates quite dramatically.

Your company is eager to move to the cloud to reduce the overhead associated with on-premises infrastructure and maintenance and to benefit from the cost savings. They are also hoping to modernize their existing infrastructure to use more servers offerings m order to take advantage of the cloud Because of the tuning of their contract renewal with the colocation facility they have only 2 months for their initial migration How should you recommend they approach thee upcoming migration strategy so they can maximize their cost savings in the cloud will still executing the migration in time?

A.
Migrate the workloads to Dataproc plus HOPS, modernize later
A.
Migrate the workloads to Dataproc plus HOPS, modernize later
Answers
B.
Migrate the workloads to Dataproc plus Cloud Storage modernize later
B.
Migrate the workloads to Dataproc plus Cloud Storage modernize later
Answers
C.
Migrate the Spark workload to Dataproc plus HDFS, and modernize the Hive workload for BigQuery
C.
Migrate the Spark workload to Dataproc plus HDFS, and modernize the Hive workload for BigQuery
Answers
D.
Modernize the Spark workload for Dataflow and the Hive workload for BigQuery
D.
Modernize the Spark workload for Dataflow and the Hive workload for BigQuery
Answers
Suggested answer: D

You are collecting loT sensor data from millions of devices across the world and storing the data in BigQuery. Your access pattern is based on recent data tittered by location_id and device_version with the following query:

You want to optimize your queries for cost and performance. How should you structure your data?

A.
Partition table data by create_date, location_id and device_version
A.
Partition table data by create_date, location_id and device_version
Answers
B.
Partition table data by create_date cluster table data by tocation_id and device_version
B.
Partition table data by create_date cluster table data by tocation_id and device_version
Answers
C.
Cluster table data by create_date location_id and device_version
C.
Cluster table data by create_date location_id and device_version
Answers
D.
Cluster table data by create_date, partition by location and device_version
D.
Cluster table data by create_date, partition by location and device_version
Answers
Suggested answer: C

You want to optimize your queries for cost and performance. How should you structure your data?

A.
Partition table data by create_date, location_id and device_version
A.
Partition table data by create_date, location_id and device_version
Answers
B.
Partition table data by create_date cluster table data by location_Id and device_version
B.
Partition table data by create_date cluster table data by location_Id and device_version
Answers
C.
Cluster table data by create_date location_id and device_version
C.
Cluster table data by create_date location_id and device_version
Answers
D.
Cluster table data by create_date partition by locationed and device_version
D.
Cluster table data by create_date partition by locationed and device_version
Answers
Suggested answer: B

A live TV show asks viewers to cast votes using their mobile phones. The event generates a large volume of data during a 3 minute period. You are in charge of the Voting restructure* and must ensure that the platform can handle the load and Hal all votes are processed. You must display partial results write voting is open. After voting doses you need to count the votes exactly once white optimizing cost. What should you do?

A.
Create a Memorystore instance with a high availability (HA) configuration
A.
Create a Memorystore instance with a high availability (HA) configuration
Answers
B.
Write votes to a Pub Sub tope and have Cloud Functions subscribe to it and write voles to BigQuery
B.
Write votes to a Pub Sub tope and have Cloud Functions subscribe to it and write voles to BigQuery
Answers
C.
Write votes to a Pub/Sub tope and toad into both Bigtable and BigQuery via a Dataflow pipeline Query Bigtable for real-time results and BigQuery for later analysis Shutdown the Bigtable instance when voting concludesD Create a Cloud SQL for PostgreSQL database with high availability (HA) configuration and multiple read replicas
C.
Write votes to a Pub/Sub tope and toad into both Bigtable and BigQuery via a Dataflow pipeline Query Bigtable for real-time results and BigQuery for later analysis Shutdown the Bigtable instance when voting concludesD Create a Cloud SQL for PostgreSQL database with high availability (HA) configuration and multiple read replicas
Answers
Suggested answer: C

You are updating the code for a subscriber to a Put/Sub feed. You are concerned that upon deployment the subscriber may erroneously acknowledge messages, leading to message loss. You subscriber is not set up to retain acknowledged messages. What should you do to ensure that you can recover from errors after deployment?

A.
Use Cloud Build for your deployment if an error occurs after deployment, use a Seek operation to locate a tmestamp logged by Cloud Build at the start of the deployment
A.
Use Cloud Build for your deployment if an error occurs after deployment, use a Seek operation to locate a tmestamp logged by Cloud Build at the start of the deployment
Answers
B.
Create a Pub/Sub snapshot before deploying new subscriber code. Use a Seek operation to redeliver messages that became available after the snapshot was created
B.
Create a Pub/Sub snapshot before deploying new subscriber code. Use a Seek operation to redeliver messages that became available after the snapshot was created
Answers
C.
Set up the Pub/Sub emulator on your local machine Validate the behavior of your new subscriber togs before deploying it to production
C.
Set up the Pub/Sub emulator on your local machine Validate the behavior of your new subscriber togs before deploying it to production
Answers
D.
Enable dead-lettering on the Pub/Sub topic to capture messages that aren't successful acknowledged if an error occurs after deployment, re-deliver any messages captured by the deadletter queue
D.
Enable dead-lettering on the Pub/Sub topic to capture messages that aren't successful acknowledged if an error occurs after deployment, re-deliver any messages captured by the deadletter queue
Answers
Suggested answer: B

Government regulations in the banking industry mandate the protection of client's personally identifiable information (PII). Your company requires PII to be access controlled encrypted and compliant with major data protection standards In addition to using Cloud Data Loss Prevention (Cloud DIP) you want to follow Google-recommended practices and use service accounts to control access to PII. What should you do?

A.
Assign the required identity and Access Management (IAM) roles to every employee, and create a single service account to access protect resources
A.
Assign the required identity and Access Management (IAM) roles to every employee, and create a single service account to access protect resources
Answers
B.
Use one service account to access a Cloud SQL database and use separate service accounts for each human user
B.
Use one service account to access a Cloud SQL database and use separate service accounts for each human user
Answers
C.
Use Cloud Storage to comply with major data protection standards. Use one service account shared by all users
C.
Use Cloud Storage to comply with major data protection standards. Use one service account shared by all users
Answers
D.
Use Cloud Storage to comply with major data protection standards. Use multiple service accounts attached to IAM groups to grant the appropriate access to each group
D.
Use Cloud Storage to comply with major data protection standards. Use multiple service accounts attached to IAM groups to grant the appropriate access to each group
Answers
Suggested answer: D

You are migrating a table to BigQuery and are deeding on the data model. Your table stores information related to purchases made across several store locations and includes information like the time of the transaction, items purchased, the store ID and the city and state in which the store is located You frequently query this table to see how many of each item were sold over the past 30 days and to look at purchasing trends by state city and individual store. You want to model this table to minimize query time and cost. What should you do?

A.
Partition by transaction time; cluster by state first, then city then store ID
A.
Partition by transaction time; cluster by state first, then city then store ID
Answers
B.
Partition by transaction tome cluster by store ID first, then city, then stale
B.
Partition by transaction tome cluster by store ID first, then city, then stale
Answers
C.
Top-level cluster by stale first, then city then store
C.
Top-level cluster by stale first, then city then store
Answers
D.
Top-level cluster by store ID first, then city then state.
D.
Top-level cluster by store ID first, then city then state.
Answers
Suggested answer: C

You are building a data pipeline on Google Cloud. You need to prepare data using a casual method for a machine-learning process. You want to support a logistic regression model. You also need to monitor and adjust for null values, which must remain real-valued and cannot be removed. What should you do?

A.
Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 'none' using a Cloud Dataproc job.
A.
Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 'none' using a Cloud Dataproc job.
Answers
B.
Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 0 using a Cloud Dataprep job.
B.
Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 0 using a Cloud Dataprep job.
Answers
C.
Use Cloud Dataflow to find null values in sample source data. Convert all nulls to 'none' using a Cloud Dataprep job.
C.
Use Cloud Dataflow to find null values in sample source data. Convert all nulls to 'none' using a Cloud Dataprep job.
Answers
D.
Use Cloud Dataflow to find null values in sample source data. Convert all nulls to using a custom script.
D.
Use Cloud Dataflow to find null values in sample source data. Convert all nulls to using a custom script.
Answers
Suggested answer: C

Explanation:


You have an Oracle database deployed in a VM as part of a Virtual Private Cloud (VPC) network. You want to replicate and continuously synchronize 50 tables to BigQuery. You want to minimize the need to manage infrastructure. What should you do?

A.
Create a Datastream service from Oracle to BigQuery, use a private connectivity configuration to the same VPC network, and a connection profile to BigQuery.
A.
Create a Datastream service from Oracle to BigQuery, use a private connectivity configuration to the same VPC network, and a connection profile to BigQuery.
Answers
B.
Create a Pub/Sub subscription to write to BigQuery directly Deploy the Debezium Oracle connector to capture changes in the Oracle database, and sink to the Pub/Sub topic.
B.
Create a Pub/Sub subscription to write to BigQuery directly Deploy the Debezium Oracle connector to capture changes in the Oracle database, and sink to the Pub/Sub topic.
Answers
C.
Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle Change Data Capture (CDC), and Dataflow to stream the Kafka topic to BigQuery. D O Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle change data capture (CDC), and the Kafka Connect Google BigQuery Sink Connector.
C.
Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle Change Data Capture (CDC), and Dataflow to stream the Kafka topic to BigQuery. D O Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle change data capture (CDC), and the Kafka Connect Google BigQuery Sink Connector.
Answers
Suggested answer: A

Explanation:

Datastream is a serverless, scalable, and reliable service that enables you to stream data changes from Oracle and MySQL databases to Google Cloud services such as BigQuery, Cloud SQL, Google Cloud Storage, and Cloud Pub/Sub. Datastream captures and streams database changes using change data capture (CDC) technology. Datastream supports private connectivity to the source and destination systems using VPC networks. Datastream also provides a connection profile to BigQuery, which simplifies the configuration and management of the data replication.Reference:

Datastream overview

Creating a Datastream stream

Using Datastream with BigQuery

Total 372 questions
Go to page: of 38