ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 34

Question list
Search
Search

List of questions

Search

Related questions











You work for a large real estate firm and are preparing 6 TB of home sales data lo be used for machine learning You will use SOL to transform the data and use BigQuery ML lo create a machine learning model. You plan to use the model for predictions against a raw dataset that has not been transformed. How should you set up your workflow in order to prevent skew at prediction time?

A.
When creating your model, use BigQuerys TRANSFORM clause to define preprocessing stops. At prediction time, use BigQuery's ML. EVALUATE clause without specifying any transformations on the raw input data.
A.
When creating your model, use BigQuerys TRANSFORM clause to define preprocessing stops. At prediction time, use BigQuery's ML. EVALUATE clause without specifying any transformations on the raw input data.
Answers
B.
When creating your model, use BigQuery's TRANSFORM clause to define preprocessing steps Before requesting predictions, use a saved query to transform your raw input data, and then use ML. EVALUATE
B.
When creating your model, use BigQuery's TRANSFORM clause to define preprocessing steps Before requesting predictions, use a saved query to transform your raw input data, and then use ML. EVALUATE
Answers
C.
Use a BigOuery to define your preprocessing logic. When creating your model, use the view as your model training data. At prediction lime, use BigQuery's ML EVALUATE clause without specifying any transformations on the raw input data.
C.
Use a BigOuery to define your preprocessing logic. When creating your model, use the view as your model training data. At prediction lime, use BigQuery's ML EVALUATE clause without specifying any transformations on the raw input data.
Answers
D.
Preprocess all data using Dataflow. At prediction time, use BigOuery's ML. EVALUATE clause without specifying any further transformations on the input data.
D.
Preprocess all data using Dataflow. At prediction time, use BigOuery's ML. EVALUATE clause without specifying any further transformations on the input data.
Answers
Suggested answer: A

Explanation:

https://cloud.google.com/bigquery-ml/docs/bigqueryml-transform Using the TRANSFORM clause, you can specify all preprocessing during model creation. The preprocessing is automatically applied during the prediction and evaluation phases of machine learning

You have a data pipeline with a Dataflow job that aggregates and writes time series metrics to Bigtable. You notice that data is slow to update in Bigtable. This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data. What should you do?

Choose 2 answers

A.
Configure your Dataflow pipeline to use local execution.
A.
Configure your Dataflow pipeline to use local execution.
Answers
B.
Modify your Dataflow pipeline lo use the Flatten transform before writing to Bigtable.
B.
Modify your Dataflow pipeline lo use the Flatten transform before writing to Bigtable.
Answers
C.
Modify your Dataflow pipeline to use the CoGrcupByKey transform before writing to Bigtable.
C.
Modify your Dataflow pipeline to use the CoGrcupByKey transform before writing to Bigtable.
Answers
D.
Increase the maximum number of Dataflow workers by setting maxNumWorkers in PipelineOptions.
D.
Increase the maximum number of Dataflow workers by setting maxNumWorkers in PipelineOptions.
Answers
E.
Increase the number of nodes in the Bigtable cluster.
E.
Increase the number of nodes in the Bigtable cluster.
Answers
Suggested answer: D, E

Explanation:

https://cloud.google.com/bigtable/docs/performance#performance-write-throughput

https://cloud.google.com/dataflow/docs/reference/pipeline-options

One of your encryption keys stored in Cloud Key Management Service (Cloud KMS) was exposed. You need to re-encrypt all of your CMEK-protected Cloud Storage data that used that key. and then delete the compromised key. You also want to reduce the risk of objects getting written without customer-managed encryption key (CMEK protection in the future. What should you do?

A.
Rotate the Cloud KMS key version. Continue to use the same Cloud Storage bucket.
A.
Rotate the Cloud KMS key version. Continue to use the same Cloud Storage bucket.
Answers
B.
Create a new Cloud KMS key. Set the default CMEK key on the existing Cloud Storage bucket to the new one.
B.
Create a new Cloud KMS key. Set the default CMEK key on the existing Cloud Storage bucket to the new one.
Answers
C.
Create a new Cloud KMS key. Create a new Cloud Storage bucket. Copy all objects from the old bucket to the new one bucket while specifying the new Cloud KMS key in the copy command.
C.
Create a new Cloud KMS key. Create a new Cloud Storage bucket. Copy all objects from the old bucket to the new one bucket while specifying the new Cloud KMS key in the copy command.
Answers
D.
Create a new Cloud KMS key. Create a new Cloud Storage bucket configured to use the new key as the default CMEK key. Copy all objects from the old bucket to the new bucket without specifying a key.
D.
Create a new Cloud KMS key. Create a new Cloud Storage bucket configured to use the new key as the default CMEK key. Copy all objects from the old bucket to the new bucket without specifying a key.
Answers
Suggested answer: C

Explanation:

To re-encrypt all of your CMEK-protected Cloud Storage data after a key has been exposed, and to ensure future writes are protected with a new key, creating a new Cloud KMS key and a new Cloud Storage bucket is the best approach. Here's why option C is the best choice:

Re-encryption of Data:

By creating a new Cloud Storage bucket and copying all objects from the old bucket to the new bucket while specifying the new Cloud KMS key, you ensure that all data is re-encrypted with the new key.

This process effectively re-encrypts the data, removing any dependency on the compromised key.

Ensuring CMEK Protection:

Creating a new bucket and setting the new CMEK as the default ensures that all future objects written to the bucket are automatically protected with the new key.

This reduces the risk of objects being written without CMEK protection.

Deletion of Compromised Key:

Once the data has been copied and re-encrypted, the old key can be safely deleted from Cloud KMS, eliminating the risk associated with the compromised key.

Steps to Implement:

Create a New Cloud KMS Key:

Create a new encryption key in Cloud KMS to replace the compromised key.

Create a New Cloud Storage Bucket:

Create a new Cloud Storage bucket and set the default CMEK to the new key.

Copy and Re-encrypt Data:

Use the gsutil tool to copy data from the old bucket to the new bucket while specifying the new CMEK key:

gsutil -o 'GSUtil:gs_json_api_version=2' cp -r gs://old-bucket/* gs://new-bucket/

Delete the Old Key:

After ensuring all data is copied and re-encrypted, delete the compromised key from Cloud KMS.

Cloud KMS Documentation

Cloud Storage Encryption

Re-encrypting Data in Cloud Storage

You are designing the architecture of your application to store data in Cloud Storage. Your application consists of pipelines that read data from a Cloud Storage bucket that contains raw data, and write the data to a second bucket after processing. You want to design an architecture with Cloud Storage resources that are capable of being resilient if a Google Cloud regional failure occurs. You want to minimize the recovery point objective (RPO) if a failure occurs, with no impact on applications that use the stored data. What should you do?

A.
Adopt two regional Cloud Storage buckets, and update your application to write the output on both buckets.
A.
Adopt two regional Cloud Storage buckets, and update your application to write the output on both buckets.
Answers
B.
Adopt multi-regional Cloud Storage buckets in your architecture.
B.
Adopt multi-regional Cloud Storage buckets in your architecture.
Answers
C.
Adopt two regional Cloud Storage buckets, and create a daily task to copy from one bucket to the other.
C.
Adopt two regional Cloud Storage buckets, and create a daily task to copy from one bucket to the other.
Answers
D.
Adopt a dual-region Cloud Storage bucket, and enable turbo replication in your architecture.
D.
Adopt a dual-region Cloud Storage bucket, and enable turbo replication in your architecture.
Answers
Suggested answer: D

Explanation:

To ensure resilience and minimize the recovery point objective (RPO) with no impact on applications, using a dual-region bucket with turbo replication is the best approach. Here's why option D is the best choice:

Dual-Region Buckets:

Dual-region buckets store data redundantly across two distinct geographic regions, providing high availability and durability.

This setup ensures that data remains available even if one region experiences a failure.

Turbo Replication:

Turbo replication ensures that data is replicated between the two regions within 15 minutes, aligning with the requirement to minimize the recovery point objective (RPO).

This feature provides near real-time replication, significantly reducing the risk of data loss.

No Impact on Applications:

Applications continue to access the dual-region bucket without any changes, ensuring seamless operation even during a regional failure.

The dual-region setup transparently handles failover, providing uninterrupted access to data.

Steps to Implement:

Create a Dual-Region Bucket:

Create a dual-region Cloud Storage bucket in the Google Cloud Console, selecting appropriate regions (e.g., us-central1 and us-east1).

Enable Turbo Replication:

Enable turbo replication to ensure rapid data replication between the selected regions.

Configure Applications:

Ensure that applications read and write to the dual-region bucket, benefiting from its high availability and durability.

Test Failover:

Simulate a regional failure to verify that the dual-region bucket and turbo replication meet the required RPO and ensure data resilience.

Google Cloud Storage Dual-Region

Turbo Replication in Google Cloud Storage

You are using Workflows to call an API that returns a 1 KB JSON response, apply some complex business logic on this response, wait for the logic to complete, and then perform a load from a Cloud Storage file to BigQuery. The Workflows standard library does not have sufficient capabilities to perform your complex logic, and you want to use Python's standard library instead. You want to optimize your workflow for simplicity and speed of execution. What should you do?

A.
Invoke a Cloud Function instance that uses Python to apply the logic on your JSON file.
A.
Invoke a Cloud Function instance that uses Python to apply the logic on your JSON file.
Answers
B.
Invoke a subworkflow in Workflows to apply the logic on your JSON file.
B.
Invoke a subworkflow in Workflows to apply the logic on your JSON file.
Answers
C.
Create a Cloud Composer environment and run the logic in Cloud Composer.
C.
Create a Cloud Composer environment and run the logic in Cloud Composer.
Answers
D.
Create a Dataproc cluster, and use PySpark to apply the logic on your JSON file.
D.
Create a Dataproc cluster, and use PySpark to apply the logic on your JSON file.
Answers
Suggested answer: A

You are using BigQuery with a regional dataset that includes a table with the daily sales volumes. This table is updated multiple times per day. You need to protect your sales table in case of regional failures with a recovery point objective (RPO) of less than 24 hours, while keeping costs to a minimum. What should you do?

A.
Schedule a daily BigQuery snapshot of the table.
A.
Schedule a daily BigQuery snapshot of the table.
Answers
B.
Schedule a daily export of the table to a Cloud Storage dual or multi-region bucket.
B.
Schedule a daily export of the table to a Cloud Storage dual or multi-region bucket.
Answers
C.
Schedule a daily copy of the dataset to a backup region.
C.
Schedule a daily copy of the dataset to a backup region.
Answers
D.
Modify ETL job to load the data into both the current and another backup region.
D.
Modify ETL job to load the data into both the current and another backup region.
Answers
Suggested answer: A

Explanation:

To apply complex business logic on a JSON response using Python's standard library within a Workflow, invoking a Cloud Function is the most efficient and straightforward approach. Here's why option A is the best choice:

Cloud Functions:

Cloud Functions provide a lightweight, serverless execution environment for running code in response to events. They support Python and can easily integrate with Workflows.

This approach ensures simplicity and speed of execution, as Cloud Functions can be invoked directly from a Workflow and handle the complex logic required.

Flexibility and Simplicity:

Using Cloud Functions allows you to leverage Python's extensive standard library and ecosystem, making it easier to implement and maintain the complex business logic.

Cloud Functions abstract the underlying infrastructure, allowing you to focus on the application logic without worrying about server management.

Performance:

Cloud Functions are optimized for fast execution and can handle the processing of the JSON response efficiently.

They are designed to scale automatically based on demand, ensuring that your workflow remains performant.

Steps to Implement:

Write the Cloud Function:

Develop a Cloud Function in Python that processes the JSON response and applies the necessary business logic.

Deploy the function to Google Cloud.

Invoke Cloud Function from Workflow:

Modify your Workflow to call the Cloud Function using an HTTP request or Google Cloud Function connector.

steps:

- callCloudFunction:

call: http.post

args:

url: https://REGION-PROJECT_ID.cloudfunctions.net/FUNCTION_NAME

body:

key: value

Process Results:

Handle the response from the Cloud Function and proceed with the next steps in the Workflow, such as loading data into BigQuery.

Google Cloud Functions Documentation

Using Workflows with Cloud Functions

Workflows Standard Library

Your infrastructure team has set up an interconnect link between Google Cloud and the on-premises network. You are designing a high-throughput streaming pipeline to ingest data in streaming from an Apache Kafka cluster hosted on-premises. You want to store the data in BigQuery, with as minima! latency as possible. What should you do?

A.
Use a proxy host in the VPC in Google Cloud connecting to Kafka. Write a Dataflow pipeline, read data from the proxy host, and write the data to BigQuery.
A.
Use a proxy host in the VPC in Google Cloud connecting to Kafka. Write a Dataflow pipeline, read data from the proxy host, and write the data to BigQuery.
Answers
B.
Setup a Kafka Connect bridge between Kafka and Pub/Sub. Use a Google-provided Dataflow template to read the data from Pub/Sub, and write the data to BigQuery.
B.
Setup a Kafka Connect bridge between Kafka and Pub/Sub. Use a Google-provided Dataflow template to read the data from Pub/Sub, and write the data to BigQuery.
Answers
C.
Setup a Kafka Connect bridge between Kafka and Pub/Sub. Write a Dataflow pipeline, read the data from Pub/Sub, and write the data to BigQuery.
C.
Setup a Kafka Connect bridge between Kafka and Pub/Sub. Write a Dataflow pipeline, read the data from Pub/Sub, and write the data to BigQuery.
Answers
D.
Use Dataflow, write a pipeline that reads the data from Kafka, and writes the data to BigQuery.
D.
Use Dataflow, write a pipeline that reads the data from Kafka, and writes the data to BigQuery.
Answers
Suggested answer: C

Explanation:

Here's a detailed breakdown of why this solution is optimal and why others fall short:

Why Option C is the Best Solution:

Kafka Connect Bridge: This bridge acts as a reliable and scalable conduit between your on-premises Kafka cluster and Google Cloud's Pub/Sub messaging service. It handles the complexities of securely transferring data over the interconnect link.

Pub/Sub as a Buffer: Pub/Sub serves as a highly scalable buffer, decoupling the Kafka producer from the Dataflow consumer. This is crucial for handling fluctuations in message volume and ensuring smooth data flow even during spikes.

Custom Dataflow Pipeline: Writing a custom Dataflow pipeline gives you the flexibility to implement any necessary transformations or enrichments to the data before it's written to BigQuery. This is often required in real-world streaming scenarios.

Minimal Latency: By using Pub/Sub as a buffer and Dataflow for efficient processing, you minimize the latency between the data being produced in Kafka and being available for querying in BigQuery.

Why Other Options Are Not Ideal:

Option A: Using a proxy host introduces an additional point of failure and can create a bottleneck, especially with high-throughput streaming.

Option B: While Google-provided Dataflow templates can be helpful, they might lack the customization needed for specific transformations or handling complex data structures.

Option D: Dataflow doesn't natively connect to on-premises Kafka clusters. Directly reading from Kafka would require complex networking configurations and could lead to performance issues.

Additional Considerations:

Schema Management: Ensure that the schema of the data being produced in Kafka is compatible with the schema expected in BigQuery. Consider using tools like Schema Registry for schema evolution management.

Monitoring: Set up robust monitoring and alerting to detect any issues in the pipeline, such as message backlogs or processing errors.

By following Option C, you leverage the strengths of Kafka Connect, Pub/Sub, and Dataflow to create a high-throughput, low-latency streaming pipeline that seamlessly integrates your on-premises Kafka data with BigQuery.

Your organization uses a multi-cloud data storage strategy, storing data in Cloud Storage, and data in Amazon Web Services' (AWS) S3 storage buckets. All data resides in US regions. You want to query up-to-date data by using BigQuery. regardless of which cloud the data is stored in. You need to allow users to query the tables from BigQuery without giving direct access to the data in the storage buckets What should you do?

A.
Set up a BigQuery Omni connection to the AWS S3 bucket data Create BigLake tables over the Cloud Storage and S3 data and query the data using BigQuery directly.
A.
Set up a BigQuery Omni connection to the AWS S3 bucket data Create BigLake tables over the Cloud Storage and S3 data and query the data using BigQuery directly.
Answers
B.
Set up a BigQuery Omni connection to the AWS S3 bucket data. Create external tables over the Cloud Storage and S3 data and query the data using BigQuery directly.
B.
Set up a BigQuery Omni connection to the AWS S3 bucket data. Create external tables over the Cloud Storage and S3 data and query the data using BigQuery directly.
Answers
C.
Use the Storage Transfer Service to copy data from the AWS S3 buckets to Cloud Storage buckets Create BigLake tables over the Cloud Storage data and query the data using BigQuery directly.
C.
Use the Storage Transfer Service to copy data from the AWS S3 buckets to Cloud Storage buckets Create BigLake tables over the Cloud Storage data and query the data using BigQuery directly.
Answers
D.
Use the Storage Transfer Service to copy data from the AWS S3 buckets to Cloud Storage buckets Create external tables over the Cloud Storage data and query the data using BigQuery directly
D.
Use the Storage Transfer Service to copy data from the AWS S3 buckets to Cloud Storage buckets Create external tables over the Cloud Storage data and query the data using BigQuery directly
Answers
Suggested answer: B

Explanation:

BigQuery Omni enables you to run BigQuery analytics directly on data stored in AWS S3 buckets without having to move or copy the data. This provides several benefits:

Reduced Data Movement Costs: Eliminates the need to egress data from AWS, potentially saving significant costs.

Real-Time Analytics: Allows you to query data in AWS S3 in real-time, providing up-to-date insights.

Simplified Architecture: Reduces the complexity of managing data pipelines and ETL processes.

Here's a breakdown of the steps involved in using BigQuery Omni:

Set up a BigQuery Omni connection: This involves configuring the connection between your Google Cloud project and your AWS S3 bucket. This connection establishes the secure link for BigQuery to access the data in AWS S3.

Create external tables: BigQuery external tables are a way to query data residing in external storage systems, such as AWS S3, without having to import the data into BigQuery. This enables you to directly query the data in its original location.

Query the data using BigQuery: Once the external tables are created, you can use standard SQL queries to analyze the data stored in both Cloud Storage and AWS S3, just as if it were native BigQuery data.

Why other options are not suitable:

Option A: BigLake tables are designed for storing large volumes of structured data within BigQuery itself, not for directly querying data in external storage systems.

Option C and D: While the Storage Transfer Service is useful for moving data between cloud providers, it introduces unnecessary data movement and latency if the goal is to simply query the data in its original location.

Key Points:

BigQuery Omni extends BigQuery's capabilities to analyze data stored in other cloud providers, such as AWS.

External tables provide a way to query data in external storage systems without having to import it into BigQuery.

By leveraging BigQuery Omni and external tables, you can efficiently and cost-effectively query data stored in multiple cloud environments using a single tool, BigQuery.

You have thousands of Apache Spark jobs running in your on-premises Apache Hadoop cluster. You want to migrate the jobs to Google Cloud. You want to use managed services to run your jobs instead of maintaining a long-lived Hadoop cluster yourself. You have a tight timeline and want to keep code changes to a minimum. What should you do?

A.
Copy your data to Compute Engine disks. Manage and run your jobs directly on those instances.
A.
Copy your data to Compute Engine disks. Manage and run your jobs directly on those instances.
Answers
B.
Move your data to Cloud Storage. Run your jobs on Dataproc.
B.
Move your data to Cloud Storage. Run your jobs on Dataproc.
Answers
C.
Move your data to BigQuery. Convert your Spark scripts to a SQL-based processing approach.
C.
Move your data to BigQuery. Convert your Spark scripts to a SQL-based processing approach.
Answers
D.
Rewrite your jobs in Apache Beam. Run your jobs in Dataflow.
D.
Rewrite your jobs in Apache Beam. Run your jobs in Dataflow.
Answers
Suggested answer: B

Explanation:

Dataproc's Compatibility with Apache Spark: Dataproc is a managed service for running Hadoop and Spark clusters on Google Cloud. This means it is designed to seamlessly run Apache Spark jobs with minimal code changes. Your existing Spark jobs should run on Dataproc with little to no modification.

Cloud Storage as a Scalable Data Lake: Cloud Storage provides a highly scalable and durable storage solution for your data. It's designed to handle large volumes of data that Spark jobs typically process.

Minimizing Operational Overhead: By using Dataproc, you eliminate the need to manage and maintain a Hadoop cluster yourself. Google Cloud handles the infrastructure, allowing you to focus on your data processing tasks.

Tight Timeline and Minimal Code Changes: This option directly addresses the requirements of the question. It offers a quick and easy way to migrate your Spark jobs to Google Cloud with minimal disruption to your existing codebase.

Why other options are not suitable:

A . Copy your data to Compute Engine disks. Manage and run your jobs directly on those instances: This option requires you to manage the underlying infrastructure yourself, which contradicts the requirement of using managed services.

C . Move your data to BigQuery. Convert your Spark scripts to a SQL-based processing approach: While BigQuery is a powerful data warehouse, converting Spark scripts to SQL would require substantial code changes and might not be feasible within a tight timeline.

D . Rewrite your jobs in Apache Beam. Run your jobs in Dataflow: Rewriting jobs in Apache Beam would be a significant undertaking and not suitable for a quick migration with minimal code changes.

You work for a farming company. You have one BigQuery table named sensors, which is about 500 MB and contains the list of your 5000 sensors, with columns for id, name, and location. This table is updated every hour. Each sensor generates one metric every 30 seconds along with a timestamp. which you want to store in BigQuery. You want to run an analytical query on the data once a week for monitoring purposes. You also want to minimize costs. What data model should you use?

A.
1. Create a retries column in the sensorstable. 2. Set record type and repeated mode for the metrics column. 3. Use an UPDATE statement every 30 seconds to add new metrics.
A.
1. Create a retries column in the sensorstable. 2. Set record type and repeated mode for the metrics column. 3. Use an UPDATE statement every 30 seconds to add new metrics.
Answers
B.
1. Create a metrics column in the sensors table. 2. Set RECORD type and REPEATED mode for the metrics column. 3. Use an INSERT statement every 30 seconds to add new metrics.
B.
1. Create a metrics column in the sensors table. 2. Set RECORD type and REPEATED mode for the metrics column. 3. Use an INSERT statement every 30 seconds to add new metrics.
Answers
C.
1. Create a metrics table partitioned by timestamp. 2. Create a sensorld column in the metrics table, that points to the id column in the sensors table. 3. Use an IHSEW statement every 30 seconds to append new metrics to the metrics table. 4. Join the two tables, if needed, when running the analytical query.
C.
1. Create a metrics table partitioned by timestamp. 2. Create a sensorld column in the metrics table, that points to the id column in the sensors table. 3. Use an IHSEW statement every 30 seconds to append new metrics to the metrics table. 4. Join the two tables, if needed, when running the analytical query.
Answers
D.
1. Create a metrics table partitioned by timestamp. 2. Create a sensor Id column in the metrics table, that points to the _d column in the sensors table. 3. Use an UPDATE statement every 30 seconds to append new metrics to the metrics table. 4. Join the two tables, if needed, when running the analytical query.
D.
1. Create a metrics table partitioned by timestamp. 2. Create a sensor Id column in the metrics table, that points to the _d column in the sensors table. 3. Use an UPDATE statement every 30 seconds to append new metrics to the metrics table. 4. Join the two tables, if needed, when running the analytical query.
Answers
Suggested answer: C

Explanation:

For a farming company with a sensor data table updated every 30 seconds, the goal is to minimize costs while facilitating weekly analytical queries. The best data model will effectively manage data storage, update frequency, and query performance.

Partitioned Metrics Table:

Creating a metrics table partitioned by timestamp optimizes query performance and storage costs.

Partitioning by timestamp allows for efficient querying, especially for time-based analyses.

Sensor ID

Reference:

Including a sensor_id column in the metrics table that points to the id column in the sensors table ensures data normalization.

This structure avoids redundancy and maintains a clear relationship between sensors and their metrics.

Using INSERT Statements:

Using INSERT statements to append new metrics every 30 seconds is efficient and cost-effective.

INSERT operations are more suitable than UPDATE operations for adding new data entries, especially at high frequencies.

Joining Tables for Analysis:

When running analytical queries, joining the partitioned metrics table with the sensors table as needed provides a comprehensive view of the data.

This approach leverages BigQuery's powerful JOIN capabilities while keeping the data model normalized and efficient.

Google Data Engineer Reference:

BigQuery Partitioned Tables

BigQuery Best Practices

Efficient Data Partitioning

BigQuery Data Modeling

Using this data model, the farming company can manage its sensor data effectively, minimize costs, and perform weekly analytical queries with high efficiency.

Total 372 questions
Go to page: of 38