ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 36

Question list
Search
Search

List of questions

Search

Related questions











You recently deployed several data processing jobs into your Cloud Composer 2 environment. You notice that some tasks are failing in Apache Airflow. On the monitoring dashboard, you see an increase in the total workers' memory usage, and there were worker pod evictions. You need to resolve these errors. What should you do?

Choose 2 answers

A.
Increase the directed acyclic graph (DAG) file parsing interval.
A.
Increase the directed acyclic graph (DAG) file parsing interval.
Answers
B.
Increase the memory available to the Airflow workers.
B.
Increase the memory available to the Airflow workers.
Answers
C.
Increase the maximum number of workers and reduce worker concurrency.
C.
Increase the maximum number of workers and reduce worker concurrency.
Answers
D.
Increase the memory available to the Airflow triggerer.
D.
Increase the memory available to the Airflow triggerer.
Answers
E.
Increase the Cloud Composer 2 environment size from medium to large.
E.
Increase the Cloud Composer 2 environment size from medium to large.
Answers
Suggested answer: B, C

Explanation:

To resolve issues related to increased memory usage and worker pod evictions in your Cloud Composer 2 environment, the following steps are recommended:

Increase Memory Available to Airflow Workers:

By increasing the memory allocated to Airflow workers, you can handle more memory-intensive tasks, reducing the likelihood of pod evictions due to memory limits.

Increase Maximum Number of Workers and Reduce Worker Concurrency:

Increasing the number of workers allows the workload to be distributed across more pods, preventing any single pod from becoming overwhelmed.

Reducing worker concurrency limits the number of tasks that each worker can handle simultaneously, thereby lowering the memory consumption per worker.

Steps to Implement:

Increase Worker Memory:

Modify the configuration settings in Cloud Composer to allocate more memory to Airflow workers. This can be done through the environment configuration settings.

Adjust Worker and Concurrency Settings:

Increase the maximum number of workers in the Cloud Composer environment settings.

Reduce the concurrency setting for Airflow workers to ensure that each worker handles fewer tasks at a time, thus consuming less memory per worker.

Cloud Composer Worker Configuration

Scaling Airflow Workers

You want to encrypt the customer data stored in BigQuery. You need to implement for-user crypto-deletion on data stored in your tables. You want to adopt native features in Google Cloud to avoid custom solutions. What should you do?

A.
Create a customer-managed encryption key (CMEK) in Cloud KMS. Associate the key to the table while creating the table.
A.
Create a customer-managed encryption key (CMEK) in Cloud KMS. Associate the key to the table while creating the table.
Answers
B.
Create a customer-managed encryption key (CMEK) in Cloud KMS. Use the key to encrypt data before storing in BigQuery.
B.
Create a customer-managed encryption key (CMEK) in Cloud KMS. Use the key to encrypt data before storing in BigQuery.
Answers
C.
Implement Authenticated Encryption with Associated Data (AEAD) BigQuery functions while storing your data in BigQuery.
C.
Implement Authenticated Encryption with Associated Data (AEAD) BigQuery functions while storing your data in BigQuery.
Answers
D.
Encrypt your data during ingestion by using a cryptographic library supported by your ETL pipeline.
D.
Encrypt your data during ingestion by using a cryptographic library supported by your ETL pipeline.
Answers
Suggested answer: A

Explanation:

To implement for-user crypto-deletion and ensure that customer data stored in BigQuery is encrypted, using native Google Cloud features, the best approach is to use Customer-Managed Encryption Keys (CMEK) with Cloud Key Management Service (KMS). Here's why:

Customer-Managed Encryption Keys (CMEK):

CMEK allows you to manage your own encryption keys using Cloud KMS. These keys provide additional control over data access and encryption management.

Associating a CMEK with a BigQuery table ensures that data is encrypted with a key you manage.

For-User Crypto-Deletion:

For-user crypto-deletion can be achieved by disabling or destroying the CMEK. Once the key is disabled or destroyed, the data encrypted with that key cannot be decrypted, effectively rendering it unreadable.

Native Integration:

Using CMEK with BigQuery is a native feature, avoiding the need for custom encryption solutions. This simplifies the management and implementation of encryption and decryption processes.

Steps to Implement:

Create a CMEK in Cloud KMS:

Set up a new customer-managed encryption key in Cloud KMS.

Associate the CMEK with BigQuery Tables:

When creating a new table in BigQuery, specify the CMEK to be used for encryption.

This can be done through the BigQuery console, CLI, or API.

BigQuery and CMEK

Cloud KMS Documentation

Encrypting Data in BigQuery

You are designing a data mesh on Google Cloud by using Dataplex to manage data in BigQuery and Cloud Storage. You want to simplify data asset permissions. You are creating a customer virtual lake with two user groups:

* Data engineers, which require lull data lake access

* Analytic users, which require access to curated data

You need to assign access rights to these two groups. What should you do?

A.
1. Grant the dataplex.dataOwner role to the data engineer group on the customer data lake. 2. Grant the dataplex.dataReader role to the analytic user group on the customer curated zone.
A.
1. Grant the dataplex.dataOwner role to the data engineer group on the customer data lake. 2. Grant the dataplex.dataReader role to the analytic user group on the customer curated zone.
Answers
B.
1. Grant the dataplex.dataReader role to the data engineer group on the customer data lake. 2. Grant the dataplex.dataOwner to the analytic user group on the customer curated zone.
B.
1. Grant the dataplex.dataReader role to the data engineer group on the customer data lake. 2. Grant the dataplex.dataOwner to the analytic user group on the customer curated zone.
Answers
C.
1. Grant the bigquery.dataownex role on BigQuery datasets and the storage.objectcreator role on Cloud Storage buckets to data engineers. 2. Grant the bigquery.dataViewer role on BigQuery datasets and the storage.objectViewer role on Cloud Storage buckets to analytic users.
C.
1. Grant the bigquery.dataownex role on BigQuery datasets and the storage.objectcreator role on Cloud Storage buckets to data engineers. 2. Grant the bigquery.dataViewer role on BigQuery datasets and the storage.objectViewer role on Cloud Storage buckets to analytic users.
Answers
D.
1. Grant the bigquery.dataViewer role on BigQuery datasets and the storage.objectviewer role on Cloud Storage buckets to data engineers. 2. Grant the bigquery.dataOwner role on BigQuery datasets and the storage.objectEditor role on Cloud Storage buckets to analytic users.
D.
1. Grant the bigquery.dataViewer role on BigQuery datasets and the storage.objectviewer role on Cloud Storage buckets to data engineers. 2. Grant the bigquery.dataOwner role on BigQuery datasets and the storage.objectEditor role on Cloud Storage buckets to analytic users.
Answers
Suggested answer: A

Explanation:

When designing a data mesh on Google Cloud using Dataplex to manage data in BigQuery and Cloud Storage, it is essential to simplify data asset permissions while ensuring that each user group has the appropriate access levels. Here's why option A is the best choice:

Data Engineer Group:

Data engineers require full access to the data lake to manage and operate data assets comprehensively. Granting the dataplex.dataOwner role to the data engineer group on the customer data lake ensures they have the necessary permissions to create, modify, and delete data assets within the lake.

Analytic User Group:

Analytic users need access to curated data but do not require full control over all data assets. Granting the dataplex.dataReader role to the analytic user group on the customer curated zone provides read-only access to the curated data, enabling them to analyze the data without the ability to modify or delete it.

Steps to Implement:

Grant Data Engineer Permissions:

Assign the dataplex.dataOwner role to the data engineer group on the customer data lake to ensure full access and management capabilities.

Grant Analytic User Permissions:

Assign the dataplex.dataReader role to the analytic user group on the customer curated zone to provide read-only access to curated data.

Dataplex IAM Roles and Permissions

Managing Access in Dataplex

Your car factory is pushing machine measurements as messages into a Pub/Sub topic in your Google Cloud project. A Dataflow streaming job. that you wrote with the Apache Beam SDK, reads these messages, sends acknowledgment lo Pub/Sub. applies some custom business logic in a Doffs instance, and writes the result to BigQuery. You want to ensure that if your business logic fails on a message, the message will be sent to a Pub/Sub topic that you want to monitor for alerting purposes. What should you do?

A.
Use an exception handling block in your Data Flow's Doffs code to push the messages that failed to be transformed through a side output and to a new Pub/Sub topic. Use Cloud Monitoring to monitor the topic/num_jnacked_messages_by_region metric on this new topic.
A.
Use an exception handling block in your Data Flow's Doffs code to push the messages that failed to be transformed through a side output and to a new Pub/Sub topic. Use Cloud Monitoring to monitor the topic/num_jnacked_messages_by_region metric on this new topic.
Answers
B.
Enable retaining of acknowledged messages in your Pub/Sub pull subscription. Use Cloud Monitoring to monitor the subscription/num_retained_acked_messages metric on this subscription.
B.
Enable retaining of acknowledged messages in your Pub/Sub pull subscription. Use Cloud Monitoring to monitor the subscription/num_retained_acked_messages metric on this subscription.
Answers
C.
Enable dead lettering in your Pub/Sub pull subscription, and specify a new Pub/Sub topic as the dead letter topic. Use Cloud Monitoring to monitor the subscription/dead_letter_message_count metric on your pull subscription.
C.
Enable dead lettering in your Pub/Sub pull subscription, and specify a new Pub/Sub topic as the dead letter topic. Use Cloud Monitoring to monitor the subscription/dead_letter_message_count metric on your pull subscription.
Answers
D.
Create a snapshot of your Pub/Sub pull subscription. Use Cloud Monitoring to monitor the snapshot/numessages metric on this snapshot.
D.
Create a snapshot of your Pub/Sub pull subscription. Use Cloud Monitoring to monitor the snapshot/numessages metric on this snapshot.
Answers
Suggested answer: C

Explanation:

To ensure that messages failing to process in your Dataflow job are sent to a Pub/Sub topic for monitoring and alerting, the best approach is to use Pub/Sub's dead-letter topic feature. Here's why option C is the best choice:

Dead-Letter Topic:

Pub/Sub's dead-letter topic feature allows messages that fail to be processed successfully to be redirected to a specified topic. This ensures that these messages are not lost and can be reviewed for debugging and alerting purposes.

Monitoring and Alerting:

By specifying a new Pub/Sub topic as the dead-letter topic, you can use Cloud Monitoring to track metrics such as subscription/dead_letter_message_count, providing visibility into the number of failed messages.

This allows you to set up alerts based on these metrics to notify the appropriate teams when failures occur.

Steps to Implement:

Enable Dead-Letter Topic:

Configure your Pub/Sub pull subscription to enable dead lettering and specify the new Pub/Sub topic for dead-letter messages.

Set Up Monitoring:

Use Cloud Monitoring to monitor the subscription/dead_letter_message_count metric on your pull subscription.

Configure alerts based on this metric to notify the team of any processing failures.

Pub/Sub Dead Letter Policy

Cloud Monitoring with Pub/Sub

You are a BigQuery admin supporting a team of data consumers who run ad hoc queries and downstream reporting in tools such as Looker. All data and users are combined under a single organizational project. You recently noticed some slowness in query results and want to troubleshoot where the slowdowns are occurring. You think that there might be some job queuing or slot contention occurring as users run jobs, which slows down access to results. You need to investigate the query job information and determine where performance is being affected. What should you do?

A.
Use Cloud Monitoring to view BigQuery metrics and set up alerts that let you know when a certain percentage of slots were used.
A.
Use Cloud Monitoring to view BigQuery metrics and set up alerts that let you know when a certain percentage of slots were used.
Answers
B.
Use slot reservations for your project to ensure that you have enough query processing capacity and are able to allocate available slots to the slower queries.
B.
Use slot reservations for your project to ensure that you have enough query processing capacity and are able to allocate available slots to the slower queries.
Answers
C.
Use Cloud Logging to determine if any users or downstream consumers are changing or deleting access grants on tagged resources.
C.
Use Cloud Logging to determine if any users or downstream consumers are changing or deleting access grants on tagged resources.
Answers
D.
Use available administrative resource charts to determine how slots are being used and how jobs are performing over time. Run a query on the INFORMATION_SCHEMA to review query performance.
D.
Use available administrative resource charts to determine how slots are being used and how jobs are performing over time. Run a query on the INFORMATION_SCHEMA to review query performance.
Answers
Suggested answer: D

Explanation:

To troubleshoot query performance issues related to job queuing or slot contention in BigQuery, using administrative resource charts along with querying the INFORMATION_SCHEMA is the best approach. Here's why option D is the best choice:

Administrative Resource Charts:

BigQuery provides detailed resource charts that show slot usage and job performance over time. These charts help identify patterns of slot contention and peak usage times.

INFORMATION_SCHEMA Queries:

The INFORMATION_SCHEMA tables in BigQuery provide detailed metadata about query jobs, including execution times, slots consumed, and other performance metrics.

Running queries on INFORMATION_SCHEMA allows you to pinpoint specific jobs causing contention and analyze their performance characteristics.

Comprehensive Analysis:

Combining administrative resource charts with detailed queries on INFORMATION_SCHEMA provides a holistic view of the system's performance.

This approach enables you to identify and address the root causes of performance issues, whether they are due to slot contention, inefficient queries, or other factors.

Steps to Implement:

Access Administrative Resource Charts:

Use the Google Cloud Console to view BigQuery's administrative resource charts. These charts provide insights into slot utilization and job performance metrics over time.

Run INFORMATION_SCHEMA Queries:

Execute queries on BigQuery's INFORMATION_SCHEMA to gather detailed information about job performance. For example:

SELECT

creation_time,

job_id,

user_email,

query,

total_slot_ms / 1000 AS slot_seconds,

total_bytes_processed / (1024 * 1024 * 1024) AS processed_gb,

total_bytes_billed / (1024 * 1024 * 1024) AS billed_gb

FROM

`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT

WHERE

creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)

AND state = 'DONE'

ORDER BY

slot_seconds DESC

LIMIT 100;

Analyze and Optimize:

Use the information gathered to identify bottlenecks, optimize queries, and adjust resource allocations as needed to improve performance.

Monitoring BigQuery Slots

BigQuery INFORMATION_SCHEMA

BigQuery Performance Best Practices


You created an analytics environment on Google Cloud so that your data scientist team can explore data without impacting the on-premises Apache Hadoop solution. The data in the on-premises Hadoop Distributed File System (HDFS) cluster is in Optimized Row Columnar (ORC) formatted files with multiple columns of Hive partitioning. The data scientist team needs to be able to explore the data in a similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine. You need to choose the most cost-effective storage and processing solution. What should you do?

A.
Import the ORC files lo Bigtable tables for the data scientist team.
A.
Import the ORC files lo Bigtable tables for the data scientist team.
Answers
B.
Import the ORC files to BigOuery tables for the data scientist team.
B.
Import the ORC files to BigOuery tables for the data scientist team.
Answers
C.
Copy the ORC files on Cloud Storage, then deploy a Dataproc cluster for the data scientist team.
C.
Copy the ORC files on Cloud Storage, then deploy a Dataproc cluster for the data scientist team.
Answers
D.
Copy the ORC files on Cloud Storage, then create external BigQuery tables for the data scientist team.
D.
Copy the ORC files on Cloud Storage, then create external BigQuery tables for the data scientist team.
Answers
Suggested answer: D

You are migrating your on-premises data warehouse to BigQuery. As part of the migration, you want to facilitate cross-team collaboration to get the most value out of the organization's data. You need to design an architecture that would allow teams within the organization to securely publish, discover, and subscribe to read-only data in a self-service manner. You need to minimize costs while also maximizing data freshness What should you do?

A.
Create authorized datasets to publish shared data in the subscribing team's project.
A.
Create authorized datasets to publish shared data in the subscribing team's project.
Answers
B.
Create a new dataset for sharing in each individual team's project. Grant the subscribing team the bigquery. dataViewer role on the dataset.
B.
Create a new dataset for sharing in each individual team's project. Grant the subscribing team the bigquery. dataViewer role on the dataset.
Answers
C.
Use BigQuery Data Transfer Service to copy datasets to a centralized BigQuery project for sharing.
C.
Use BigQuery Data Transfer Service to copy datasets to a centralized BigQuery project for sharing.
Answers
D.
Use Analytics Hub to facilitate data sharing.
D.
Use Analytics Hub to facilitate data sharing.
Answers
Suggested answer: C

Explanation:

To provide a cost-effective storage and processing solution that allows data scientists to explore data similarly to using the on-premises HDFS cluster with SQL on the Hive query engine, deploying a Dataproc cluster is the best choice. Here's why:

Compatibility with Hive:

Dataproc is a fully managed Apache Spark and Hadoop service that provides native support for Hive, making it easy for data scientists to run SQL queries on the data as they would in an on-premises Hadoop environment.

This ensures that the transition to Google Cloud is smooth, with minimal changes required in the workflow.

Cost-Effective Storage:

Storing the ORC files in Cloud Storage is cost-effective and scalable, providing a reliable and durable storage solution that integrates seamlessly with Dataproc.

Cloud Storage allows you to store large datasets at a lower cost compared to other storage options.

Hive Integration:

Dataproc supports running Hive directly, which is essential for data scientists familiar with SQL on the Hive query engine.

This setup enables the use of existing Hive queries and scripts without significant modifications.

Steps to Implement:

Copy ORC Files to Cloud Storage:

Transfer the ORC files from the on-premises HDFS cluster to Cloud Storage, ensuring they are organized in a similar directory structure.

Deploy Dataproc Cluster:

Set up a Dataproc cluster configured to run Hive. Ensure that the cluster has access to the ORC files stored in Cloud Storage.

Configure Hive:

Configure Hive on Dataproc to read from the ORC files in Cloud Storage. This can be done by setting up external tables in Hive that point to the Cloud Storage location.

Provide Access to Data Scientists:

Grant the data scientist team access to the Dataproc cluster and the necessary permissions to interact with the Hive tables.

Dataproc Documentation

Hive on Dataproc

Google Cloud Storage Documentation

You have one BigQuery dataset which includes customers' street addresses. You want to retrieve all occurrences of street addresses from the dataset. What should you do?

A.
Create a deep inspection job on each table in your dataset with Cloud Data Loss Prevention and create an inspection template that includes the STREET_ADDRESS infoType.
A.
Create a deep inspection job on each table in your dataset with Cloud Data Loss Prevention and create an inspection template that includes the STREET_ADDRESS infoType.
Answers
B.
Create a de-identification job in Cloud Data Loss Prevention and use the masking transformation.
B.
Create a de-identification job in Cloud Data Loss Prevention and use the masking transformation.
Answers
C.
Write a SQL query in BigQuery by using REGEXP_CONTAINS on all tables in your dataset to find rows where the word 'street' appears.
C.
Write a SQL query in BigQuery by using REGEXP_CONTAINS on all tables in your dataset to find rows where the word 'street' appears.
Answers
D.
Create a discovery scan configuration on your organization with Cloud Data Loss Prevention and create an inspection template that includes the STREET_ADDRESS infoType.
D.
Create a discovery scan configuration on your organization with Cloud Data Loss Prevention and create an inspection template that includes the STREET_ADDRESS infoType.
Answers
Suggested answer: A

Explanation:

To retrieve all occurrences of street addresses from a BigQuery dataset, the most effective and comprehensive method is to use Cloud Data Loss Prevention (DLP). Here's why option A is the best choice:

Cloud Data Loss Prevention (DLP):

Cloud DLP is designed to discover, classify, and protect sensitive information. It includes pre-defined infoTypes for various kinds of sensitive data, including street addresses.

Using Cloud DLP ensures thorough and accurate detection of street addresses based on advanced pattern recognition and contextual analysis.

Deep Inspection Job:

A deep inspection job allows you to scan entire tables for sensitive information.

By creating an inspection template that includes the STREET_ADDRESS infoType, you can ensure that all instances of street addresses are detected across your dataset.

Scalability and Accuracy:

Cloud DLP is scalable and can handle large datasets efficiently.

It provides a high level of accuracy in identifying sensitive data, reducing the risk of missing any occurrences.

Steps to Implement:

Set Up Cloud DLP:

Enable the Cloud DLP API in your Google Cloud project.

Create an Inspection Template:

Create an inspection template in Cloud DLP that includes the STREET_ADDRESS infoType.

Run Deep Inspection Jobs:

Create and run a deep inspection job for each table in your dataset using the inspection template.

Review the inspection job results to retrieve all occurrences of street addresses.

Cloud DLP Documentation

Creating Inspection Jobs

You are migrating your on-premises data warehouse to BigQuery. One of the upstream data sources resides on a MySQL database that runs in your on-premises data center with no public IP addresses. You want to ensure that the data ingestion into BigQuery is done securely and does not go through the public internet. What should you do?

A.
Update your existing on-premises ETL tool to write to BigQuery by using the BigQuery Open Database Connectivity (ODBC) driver. Set up the proxy parameter in the Simba. googlebigqueryodbc. ini tile to point to your data center's NAT gateway.
A.
Update your existing on-premises ETL tool to write to BigQuery by using the BigQuery Open Database Connectivity (ODBC) driver. Set up the proxy parameter in the Simba. googlebigqueryodbc. ini tile to point to your data center's NAT gateway.
Answers
B.
Use Datastream to replicate data from your on-premises MySQL database to BigQuery. Gather Datastream public IP addresses of the Google Cloud region that will be used to set up the stream. Add those IP addresses to the firewall allowlist of your on-premises data center. Use IP Allovlisting as the connectivity method and Server-only as the encryption type when setting up the connection profile in Datastream.
B.
Use Datastream to replicate data from your on-premises MySQL database to BigQuery. Gather Datastream public IP addresses of the Google Cloud region that will be used to set up the stream. Add those IP addresses to the firewall allowlist of your on-premises data center. Use IP Allovlisting as the connectivity method and Server-only as the encryption type when setting up the connection profile in Datastream.
Answers
C.
Use Datastream to replicate data from your on-premises MySQL database to BigQuery. Use Forward-SSH tunnel as the connectivity method to establish a secure tunnel between Datastream and your on-premises MySQL database through a tunnel server in your on-premises data center. Use None as the encryption type when setting up the connection profile in Datastream.
C.
Use Datastream to replicate data from your on-premises MySQL database to BigQuery. Use Forward-SSH tunnel as the connectivity method to establish a secure tunnel between Datastream and your on-premises MySQL database through a tunnel server in your on-premises data center. Use None as the encryption type when setting up the connection profile in Datastream.
Answers
D.
Use Datastream to replicate data from your on-premises MySQL database to BigQuery. Set up Cloud Interconnect between your on- premises data center and Google Cloud. Use Private connectivity as the connectivity method and allocate an IP address range within your VPC network to the Datastream connectivity configuration. Use Server-only as the encryption type when setting up the connection profile in Datastream.
D.
Use Datastream to replicate data from your on-premises MySQL database to BigQuery. Set up Cloud Interconnect between your on- premises data center and Google Cloud. Use Private connectivity as the connectivity method and allocate an IP address range within your VPC network to the Datastream connectivity configuration. Use Server-only as the encryption type when setting up the connection profile in Datastream.
Answers
Suggested answer: D

Explanation:

To securely ingest data from an on-premises MySQL database into BigQuery without routing through the public internet, using Datastream with Private connectivity over Cloud Interconnect is the best approach. Here's why:

Datastream for Data Replication:

Datastream provides a managed service for data replication from various sources, including on-premises databases, to Google Cloud services like BigQuery.

Cloud Interconnect:

Cloud Interconnect establishes a private connection between your on-premises data center and Google Cloud, ensuring that data transfer occurs over a secure, private network rather than the public internet.

Private Connectivity:

Using Private connectivity with Datastream leverages the established Cloud Interconnect to securely connect your on-premises MySQL database with Google Cloud. This method ensures that the data does not traverse the public internet.

Encryption:

Using Server-only encryption ensures that data is encrypted in transit between Datastream and BigQuery, adding an extra layer of security.

Steps to Implement:

Set Up Cloud Interconnect:

Establish a Cloud Interconnect between your on-premises data center and Google Cloud to create a private connection.

Configure Datastream:

Set up Datastream to use Private connectivity as the connection method and allocate an IP address range within your VPC network.

Use Server-only encryption to ensure secure data transfer.

Create Connection Profile:

Create a connection profile in Datastream to define the connection parameters, including the use of Cloud Interconnect and Private connectivity.

Datastream Documentation

Cloud Interconnect Documentation

Setting Up Private Connectivity in Datastream

The data analyst team at your company uses BigQuery for ad-hoc queries and scheduled SQL pipelines in a Google Cloud project with a slot reservation of 2000 slots. However, with the recent introduction of hundreds of new non time-sensitive SQL pipelines, the team is encountering frequent quota errors. You examine the logs and notice that approximately 1500 queries are being triggered concurrently during peak time. You need to resolve the concurrency issue. What should you do?

A.
Update SQL pipelines and ad-hoc queries to run as interactive query jobs.
A.
Update SQL pipelines and ad-hoc queries to run as interactive query jobs.
Answers
B.
Increase the slot capacity of the project with baseline as 0 and maximum reservation size as 3000.
B.
Increase the slot capacity of the project with baseline as 0 and maximum reservation size as 3000.
Answers
C.
Update SOL pipelines to run as a batch query, and run ad-hoc queries as interactive query jobs.
C.
Update SOL pipelines to run as a batch query, and run ad-hoc queries as interactive query jobs.
Answers
D.
Increase the slot capacity of the project with baseline as 2000 and maximum reservation size as 3000.
D.
Increase the slot capacity of the project with baseline as 2000 and maximum reservation size as 3000.
Answers
Suggested answer: C

Explanation:

To resolve the concurrency issue in BigQuery caused by the introduction of hundreds of non-time-sensitive SQL pipelines, the best approach is to differentiate the types of queries based on their urgency and resource requirements. Here's why option C is the best choice:

SQL Pipelines as Batch Queries:

Batch queries in BigQuery are designed for non-time-sensitive operations. They run in a lower priority queue and do not consume slots immediately, which helps to reduce the overall slot consumption during peak times.

By converting non-time-sensitive SQL pipelines to batch queries, you can significantly alleviate the pressure on slot reservations.

Ad-Hoc Queries as Interactive Queries:

Interactive queries are prioritized to run immediately and are suitable for ad-hoc analysis where users expect quick results.

Running ad-hoc queries as interactive jobs ensures that analysts can get their results without delay, improving productivity and user satisfaction.

Concurrency Management:

This approach helps balance the workload by leveraging BigQuery's ability to handle different types of queries efficiently, reducing the likelihood of encountering quota errors due to slot exhaustion.

Steps to Implement:

Identify Non-Time-Sensitive Pipelines:

Review and identify SQL pipelines that are not time-critical and can be executed as batch jobs.

Update Pipelines to Batch Queries:

Modify these pipelines to run as batch queries. This can be done by setting the priority of the query job to BATCH.

Ensure Ad-Hoc Queries are Interactive:

Ensure that all ad-hoc queries are submitted as interactive jobs, allowing them to run with higher priority and immediate slot allocation.

BigQuery Batch Queries

BigQuery Slot Allocation and Management

Total 372 questions
Go to page: of 38