ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 29

Question list
Search
Search

List of questions

Search

Related questions











You are designing a Dataflow pipeline for a batch processing job. You want to mitigate multiple zonal failures at job submission time. What should you do?

A.
Specify a worker region by using the ---region flag.
A.
Specify a worker region by using the ---region flag.
Answers
B.
Set the pipeline staging location as a regional Cloud Storage bucket.
B.
Set the pipeline staging location as a regional Cloud Storage bucket.
Answers
C.
Submit duplicate pipelines in two different zones by using the ---zone flag.
C.
Submit duplicate pipelines in two different zones by using the ---zone flag.
Answers
D.
Create an Eventarc trigger to resubmit the job in case of zonal failure when submitting the job.
D.
Create an Eventarc trigger to resubmit the job in case of zonal failure when submitting the job.
Answers
Suggested answer: B

Explanation:


You are designing the architecture to process your data from Cloud Storage to BigQuery by using Dataflow. The network team provided you with the Shared VPC network and subnetwork to be used by your pipelines. You need to enable the deployment of the pipeline on the Shared VPC network. What should you do?

A.
Assign the compute. networkUser role to the Dataflow service agent.
A.
Assign the compute. networkUser role to the Dataflow service agent.
Answers
B.
Assign the compute.networkUser role to the service account that executes the Dataflow pipeline.
B.
Assign the compute.networkUser role to the service account that executes the Dataflow pipeline.
Answers
C.
Assign the dataflow, admin role to the Dataflow service agent.
C.
Assign the dataflow, admin role to the Dataflow service agent.
Answers
D.
Assign the dataflow, admin role to the service account that executes the Dataflow pipeline.
D.
Assign the dataflow, admin role to the service account that executes the Dataflow pipeline.
Answers
Suggested answer: B

Explanation:

To use a Shared VPC network for a Dataflow pipeline, you need to specify the subnetwork parameter with the full URL of the subnetwork, and grant the service account that executes the pipeline the compute.networkUser role in the host project. This role allows the service account to use the subnetworks in the Shared VPC network. The Dataflow service agent does not need this role, as it only creates and manages the resources for the pipeline, but does not execute it. The dataflow.admin role is not related to the network access, but to the permissions to create and delete Dataflow jobs and resources.Reference:

Specify a network and subnetwork | Cloud Dataflow | Google Cloud

How to config dataflow Pipeline to use a Shared VPC?

You are building an ELT solution in BigQuery by using Dataform. You need to perform uniqueness and null value checks on your final tables. What should you do to efficiently integrate these checks into your pipeline?

A.
Build Dataform assertions into your code
A.
Build Dataform assertions into your code
Answers
B.
Write a Spark-based stored procedure.
B.
Write a Spark-based stored procedure.
Answers
C.
Build BigQuery user-defined functions (UDFs).
C.
Build BigQuery user-defined functions (UDFs).
Answers
D.
Create Dataplex data quality tasks.
D.
Create Dataplex data quality tasks.
Answers
Suggested answer: A

Explanation:

Dataform assertions are data quality tests that find rows that violate one or more rules specified in the query. If the query returns any rows, the assertion fails. Dataform runs assertions every time it updates your SQL workflow and alerts you if any assertions fail. You can create assertions for all Dataform table types: tables, incremental tables, views, and materialized views. You can add built-in assertions to the config block of a table, such as nonNull and rowConditions, or create manual assertions with SQLX for advanced use cases. Dataform automatically creates views in BigQuery that contain the results of compiled assertion queries, which you can inspect to debug failing assertions. Dataform assertions are an efficient way to integrate data quality checks into your ELT solution in BigQuery by using Dataform.Reference:Test tables with assertions | Dataform | Google Cloud,Test data quality with assertions | Dataform,Data quality tests and documenting datasets | Dataform,Data quality testing with SQL assertions | Dataform

You are designing a real-time system for a ride hailing app that identifies areas with high demand for rides to effectively reroute available drivers to meet the demand. The system ingests data from multiple sources to Pub/Sub. processes the data, and stores the results for visualization and analysis in real-time dashboards. The data sources include driver location updates every 5 seconds and app-based booking events from riders. The data processing involves real-time aggregation of supply and demand data for the last 30 seconds, every 2 seconds, and storing the results in a low-latency system for visualization. What should you do?

A.
Group the data by using a tumbling window in a Dataflow pipeline, and write the aggregated data to Memorystore
A.
Group the data by using a tumbling window in a Dataflow pipeline, and write the aggregated data to Memorystore
Answers
B.
Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to Memorystore
B.
Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to Memorystore
Answers
C.
Group the data by using a session window in a Dataflow pipeline, and write the aggregated data to BigQuery.
C.
Group the data by using a session window in a Dataflow pipeline, and write the aggregated data to BigQuery.
Answers
D.
Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to BigQuery.
D.
Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to BigQuery.
Answers
Suggested answer: B

Explanation:

A hopping window is a type of sliding window that advances by a fixed period of time, producing overlapping windows. This is suitable for the scenario where the system needs to aggregate data for the last 30 seconds, every 2 seconds, and provide real-time updates. A Dataflow pipeline can implement the hopping window logic using Apache Beam, and process both streaming and batch data sources. Memorystore is a low-latency, in-memory data store that can serve the aggregated data to the visualization layer. BigQuery is not a good choice for this scenario, as it is not optimized for low-latency queries and frequent updates.

You orchestrate ETL pipelines by using Cloud Composer One of the tasks in the Apache Airflow directed acyclic graph (DAG) relies on a third-party service. You want to be notified when the task does not succeed. What should you do?

A.
Configure a Cloud Monitoring alert on the sla_missed metric associated with the task at risk to trigger a notification.
A.
Configure a Cloud Monitoring alert on the sla_missed metric associated with the task at risk to trigger a notification.
Answers
B.
Assign a function with notification logic to the sla_miss_callback parameter for the operator responsible for the task at risk.
B.
Assign a function with notification logic to the sla_miss_callback parameter for the operator responsible for the task at risk.
Answers
C.
Assign a function with notification logic to the on_retry_callback parameter for the operator responsible for the task at risk.
C.
Assign a function with notification logic to the on_retry_callback parameter for the operator responsible for the task at risk.
Answers
D.
Assign a function with notification logic to the on_failure_callback parameter for the operator responsible for the task at risk.
D.
Assign a function with notification logic to the on_failure_callback parameter for the operator responsible for the task at risk.
Answers
Suggested answer: D

Explanation:

By assigning a function with notification logic to the on_failure_callback parameter, you can customize the action that is taken when a task fails in your DAG1.For example, you can send an email, a Slack message, or a PagerDuty alert to notify yourself or your team about the task failure2.This option is more flexible and reliable than configuring a Cloud Monitoring alert on the sla_missed metric, which only triggers when a task misses its scheduled deadline3.The sla_miss_callback parameter is also related to the sla_missed metric, and it is executed when the task instance has not succeeded and the time is past the task's scheduled execution date plus its sla4.The on_retry_callback parameter is executed before a task is retried4. These options are not suitable for notifying when a task does not succeed, as they depend on the task's schedule and retry settings, which may not reflect the actual task completion status.Reference:

1: Callbacks | Cloud Composer | Google Cloud

2: How to Send an Email on Task Failure in Airflow - Astronomer

3: Monitoring SLA misses | Cloud Composer | Google Cloud

4: BaseOperator | Apache Airflow Documentation

You have a streaming pipeline that ingests data from Pub/Sub in production. You need to update this streaming pipeline with improved business logic. You need to ensure that the updated pipeline reprocesses the previous two days of delivered Pub/Sub messages. What should you do?

Choose 2 answers

A.
Use Pub/Sub Seek with a timestamp.
A.
Use Pub/Sub Seek with a timestamp.
Answers
B.
Use the Pub/Sub subscription clear-retry-policy flag.
B.
Use the Pub/Sub subscription clear-retry-policy flag.
Answers
C.
Create a new Pub/Sub subscription two days before the deployment.
C.
Create a new Pub/Sub subscription two days before the deployment.
Answers
D.
Use the Pub/Sub subscription retain-asked-messages flag.
D.
Use the Pub/Sub subscription retain-asked-messages flag.
Answers
E.
Use Pub/Sub Snapshot capture two days before the deployment.
E.
Use Pub/Sub Snapshot capture two days before the deployment.
Answers
Suggested answer: A, E

Explanation:

To update a streaming pipeline with improved business logic and reprocess the previous two days of delivered Pub/Sub messages, you should use Pub/Sub Seek with a timestamp and Pub/Sub Snapshot capture two days before the deployment. Pub/Sub Seek allows you to replay or purge messages in a subscription based on a time or a snapshot. Pub/Sub Snapshot allows you to capture the state of a subscription at a given point in time and replay messages from that point. By using these features, you can ensure that the updated pipeline can process the messages that were delivered in the past two days without losing any data.Reference:

Pub/Sub Seek

Pub/Sub Snapshot

You stream order data by using a Dataflow pipeline, and write the aggregated result to Memorystore. You provisioned a Memorystore for Redis instance with Basic Tier. 4 GB capacity, which is used by 40 clients for read-only access. You are expecting the number of read-only clients to increase significantly to a few hundred and you need to be able to support the demand. You want to ensure that read and write access availability is not impacted, and any changes you make can be deployed quickly. What should you do?

A.
Create multiple new Memorystore for Redis instances with Basic Tier (4 GB capacity) Modify the Dataflow pipeline and new clients to use all instances
A.
Create multiple new Memorystore for Redis instances with Basic Tier (4 GB capacity) Modify the Dataflow pipeline and new clients to use all instances
Answers
B.
Create a new Memorystore for Redis instance with Standard Tier Set capacity to 4 GB and read replica to No read replicas (high availability only). Delete the old instance.
B.
Create a new Memorystore for Redis instance with Standard Tier Set capacity to 4 GB and read replica to No read replicas (high availability only). Delete the old instance.
Answers
C.
Create a new Memorystore for Memcached instance Set a minimum of three nodes, and memory per node to 4 GB. Modify the Dataflow pipeline and all clients to use the Memcached instance Delete the old instance.
C.
Create a new Memorystore for Memcached instance Set a minimum of three nodes, and memory per node to 4 GB. Modify the Dataflow pipeline and all clients to use the Memcached instance Delete the old instance.
Answers
D.
Create a new Memorystore for Redis instance with Standard Tier Set capacity to 5 GB and create multiple read replicas Delete the old instance.
D.
Create a new Memorystore for Redis instance with Standard Tier Set capacity to 5 GB and create multiple read replicas Delete the old instance.
Answers
Suggested answer: D

Explanation:

The Basic Tier of Memorystore for Redis provides a standalone Redis instance that is not replicated and does not support read replicas. This means that it cannot scale horizontally to handle more read requests, and it does not provide high availability or automatic failover. If the number of read-only clients increases significantly, the Basic Tier instance may not be able to handle the demand and may impact the read and write access availability. Therefore, option A is not a good solution, as it would require creating multiple Basic Tier instances and modifying the Dataflow pipeline and the clients to distribute the load among them. This would increase the complexity and the management overhead of the solution.

The Standard Tier of Memorystore for Redis provides a highly available Redis instance that supports replication and read replicas. Replication ensures that the data is backed up in another zone and can fail over automatically in case of a primary node failure. Read replicas allow scaling the read throughput by adding up to five replicas to an instance and using them for read-only queries. The Standard Tier also supports in-transit encryption and maintenance windows. Therefore, option D is the best solution, as it would create a new Standard Tier instance with a higher capacity (5 GB) and multiple read replicas to handle the increased demand. The old instance can be deleted after migrating the data to the new instance.

Option B is not a good solution, as it would create a new Standard Tier instance with the same capacity (4 GB) and no read replicas. This would not improve the read throughput or the availability of the solution. Option C is not a good solution, as it would create a new Memorystore for Memcached instance, which is a different service that uses a different protocol and data model than Redis. This would require changing the code of the Dataflow pipeline and the clients to use the Memcached protocol and data structures, which would take more time and effort than migrating to a new Redis instance.Reference:Redis tier capabilities | Memorystore for Redis | Google Cloud,Pricing | Memorystore for Redis | Google Cloud,What is Memorystore? | Google Cloud Blog,Working with GCP Memorystore - Simple Talk - Redgate Software

You are troubleshooting your Dataflow pipeline that processes data from Cloud Storage to BigQuery. You have discovered that the Dataflow worker nodes cannot communicate with one another Your networking team relies on Google Cloud network tags to define firewall rules You need to identify the issue while following Google-recommended networking security practices. What should you do?

A.
Determine whether your Dataflow pipeline has a custom network tag set.
A.
Determine whether your Dataflow pipeline has a custom network tag set.
Answers
B.
Determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 for the Dataflow network tag.
B.
Determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 for the Dataflow network tag.
Answers
C.
Determine whether your Dataflow pipeline is deployed with the external IP address option enabled.
C.
Determine whether your Dataflow pipeline is deployed with the external IP address option enabled.
Answers
D.
Determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 on the subnet used by Dataflow workers.
D.
Determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 on the subnet used by Dataflow workers.
Answers
Suggested answer: B

Explanation:

Dataflow worker nodes need to communicate with each other and with the Dataflow service on TCP ports 12345 and 12346. These ports are used for data shuffling and streaming engine communication. By default, Dataflow assigns a network tag called dataflow to the worker nodes, and creates a firewall rule that allows traffic on these ports for the dataflow network tag. However, if you use a custom network tag for your Dataflow pipeline, you need to create a firewall rule that allows traffic on these ports for your custom network tag. Otherwise, the worker nodes will not be able to communicate with each other and the Dataflow service, and the pipeline will fail.

Therefore, the best way to identify the issue is to determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 for the Dataflow network tag. If there is no such firewall rule, or if the firewall rule does not match the network tag used by your Dataflow pipeline, you need to create or update the firewall rule accordingly.

Option A is not a good solution, as determining whether your Dataflow pipeline has a custom network tag set does not tell you whether there is a firewall rule that allows traffic on the required ports for that network tag. You need to check the firewall rule as well.

Option C is not a good solution, as determining whether your Dataflow pipeline is deployed with the external IP address option enabled does not tell you whether there is a firewall rule that allows traffic on the required ports for the Dataflow network tag. The external IP address option determines whether the worker nodes can access resources on the public internet, but it does not affect the internal communication between the worker nodes and the Dataflow service.

Option D is not a good solution, as determining whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 on the subnet used by Dataflow workers does not tell you whether the firewall rule applies to the Dataflow network tag. The firewall rule should be based on the network tag, not the subnet, as the network tag is more specific and secure.Reference:Dataflow network tags | Cloud Dataflow | Google Cloud,Dataflow firewall rules | Cloud Dataflow | Google Cloud,Dataflow network configuration | Cloud Dataflow | Google Cloud,Dataflow Streaming Engine | Cloud Dataflow | Google Cloud.

You are configuring networking for a Dataflow job. The data pipeline uses custom container images with the libraries that are required for the transformation logic preinstalled. The data pipeline reads the data from Cloud Storage and writes the data to BigQuery. You need to ensure cost-effective and secure communication between the pipeline and Google APIs and services. What should you do?

A.
Leave external IP addresses assigned to worker VMs while enforcing firewall rules.
A.
Leave external IP addresses assigned to worker VMs while enforcing firewall rules.
Answers
B.
Disable external IP addresses and establish a Private Service Connect endpoint IP address.
B.
Disable external IP addresses and establish a Private Service Connect endpoint IP address.
Answers
C.
Disable external IP addresses from worker VMs and enable Private Google Access.
C.
Disable external IP addresses from worker VMs and enable Private Google Access.
Answers
D.
Enable Cloud NAT to provide outbound internet connectivity while enforcing firewall rules.
D.
Enable Cloud NAT to provide outbound internet connectivity while enforcing firewall rules.
Answers
Suggested answer: C

Explanation:

Private Google Access allows VMs without external IP addresses to communicate with Google APIs and services over internal routes. This reduces the cost and increases the security of the data pipeline. Custom container images can be stored in Container Registry, which supports Private Google Access. Dataflow supports Private Google Access for both batch and streaming jobs.Reference:

Private Google Access overview

Using Private Google Access and Cloud NAT

Using custom containers with Dataflow

You work for a large ecommerce company. You store your customers order data in Bigtable. You have a garbage collection policy set to delete the data after 30 days and the number of versions is set to 1. When the data analysts run a query to report total customer spending, the analysts sometimes see customer data that is older than 30 days. You need to ensure that the analysts do not see customer data older than 30 days while minimizing cost and overhead. What should you do?

A.
Set the expiring values of the column families to 30 days and set the number of versions to 2.
A.
Set the expiring values of the column families to 30 days and set the number of versions to 2.
Answers
B.
Use a timestamp range filter in the query to fetch the customer's data for a specific range.
B.
Use a timestamp range filter in the query to fetch the customer's data for a specific range.
Answers
C.
Set the expiring values of the column families to 29 days and keep the number of versions to 1.
C.
Set the expiring values of the column families to 29 days and keep the number of versions to 1.
Answers
D.
Schedule a job daily to scan the data in the table and delete data older than 30 days.
D.
Schedule a job daily to scan the data in the table and delete data older than 30 days.
Answers
Suggested answer: B

Explanation:

By using a timestamp range filter in the query, you can ensure that the analysts only see the customer data that is within the desired time range, regardless of the garbage collection policy1. This option is the most cost-effective and simple way to avoid fetching data that is marked for deletion by garbage collection, as it does not require changing the existing policy or creating additional jobs.You can use the Bigtable client libraries or the cbt CLI to apply a timestamp range filter to your read requests2.

Option A is not effective, as it increases the number of versions to 2, which may cause more data to be retained and increase the storage costs. Option C is not reliable, as it reduces the expiring values to 29 days, which may not match the actual data arrival and usage patterns. Option D is not efficient, as it requires scheduling a job daily to scan and delete the data, which may incur additional overhead and complexity.Moreover, none of these options guarantee that the data older than 30 days will be immediately deleted, as garbage collection is an asynchronous process that can take up to a week to remove the data3.Reference:

1: Filters | Cloud Bigtable Documentation | Google Cloud

2: Read data | Cloud Bigtable Documentation | Google Cloud

3: Garbage collection overview | Cloud Bigtable Documentation | Google Cloud

Total 372 questions
Go to page: of 38