ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 37

Question list
Search
Search

List of questions

Search

Related questions











You have several different unstructured data sources, within your on-premises data center as well as in the cloud. The data is in various formats, such as Apache Parquet and CSV. You want to centralize this data in Cloud Storage. You need to set up an object sink for your data that allows you to use your own encryption keys. You want to use a GUI-based solution. What should you do?

A.
Use Cloud Data Fusion to move files into Cloud Storage.
A.
Use Cloud Data Fusion to move files into Cloud Storage.
Answers
B.
Use Storage Transfer Service to move files into Cloud Storage.
B.
Use Storage Transfer Service to move files into Cloud Storage.
Answers
C.
Use Dataflow to move files into Cloud Storage.
C.
Use Dataflow to move files into Cloud Storage.
Answers
D.
Use BigQuery Data Transfer Service to move files into BigQuery.
D.
Use BigQuery Data Transfer Service to move files into BigQuery.
Answers
Suggested answer: A

Explanation:

To centralize unstructured data from various sources into Cloud Storage using a GUI-based solution while allowing the use of your own encryption keys, Cloud Data Fusion is the most suitable option. Here's why:

Cloud Data Fusion:

Cloud Data Fusion is a fully managed, cloud-native data integration service that helps in building and managing ETL pipelines with a visual interface.

It supports a wide range of data sources and formats, including Apache Parquet and CSV, and provides a user-friendly GUI for pipeline creation and management.

Custom Encryption Keys:

Cloud Data Fusion allows the use of customer-managed encryption keys (CMEK) for data encryption, ensuring that your data is securely stored according to your encryption policies.

Centralizing Data:

Cloud Data Fusion simplifies the process of moving data from on-premises and cloud sources into Cloud Storage, providing a centralized repository for your unstructured data.

Steps to Implement:

Set Up Cloud Data Fusion:

Deploy a Cloud Data Fusion instance and configure it to connect to your various data sources.

Create ETL Pipelines:

Use the GUI to create data pipelines that extract data from your sources and load it into Cloud Storage. Configure the pipelines to use your custom encryption keys.

Run and Monitor Pipelines:

Execute the pipelines and monitor their performance and data movement through the Cloud Data Fusion dashboard.

Cloud Data Fusion Documentation

Using Customer-Managed Encryption Keys (CMEK)

You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could be stored anywhere in the United States. You need to have a recovery process in place in case of a catastrophic single region failure. You need an approach with a maximum of 15 minutes of data loss (RPO=15 mins). You want to ensure that there is minimal latency when reading the data. What should you do?

A.
1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions. 2. Enable turbo replication. 3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the us-south1 region. 4. In case of a regional failure, redeploy your Dataproc duster to the us-south1 region and continue reading from the same bucket.
A.
1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions. 2. Enable turbo replication. 3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the us-south1 region. 4. In case of a regional failure, redeploy your Dataproc duster to the us-south1 region and continue reading from the same bucket.
Answers
B.
1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions. 2. Enable turbo replication. 3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the same region. 4. In case of a regional failure, redeploy the Dataproc clusters to the us-south1 region and read from the same bucket.
B.
1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions. 2. Enable turbo replication. 3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the same region. 4. In case of a regional failure, redeploy the Dataproc clusters to the us-south1 region and read from the same bucket.
Answers
C.
1. Create a Cloud Storage bucket in the US multi-region. 2. Run the Dataproc cluster in a zone in the ua-central1 region, reading data from the US multi-region bucket. 3. In case of a regional failure, redeploy the Dataproc cluster to the us-central2 region and continue reading from the same bucket.
C.
1. Create a Cloud Storage bucket in the US multi-region. 2. Run the Dataproc cluster in a zone in the ua-central1 region, reading data from the US multi-region bucket. 3. In case of a regional failure, redeploy the Dataproc cluster to the us-central2 region and continue reading from the same bucket.
Answers
D.
1. Create two regional Cloud Storage buckets, one in the us-central1 region and one in the us-south1 region. 2. Have the upstream process write data to the us-central1 bucket. Use the Storage Transfer Service to copy data hourly from the us-central1 bucket to the us-south1 bucket. 3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in that region. 4. In case of regional failure, redeploy your Dataproc clusters to the us-south1 region and read from the bucket in that region instead.
D.
1. Create two regional Cloud Storage buckets, one in the us-central1 region and one in the us-south1 region. 2. Have the upstream process write data to the us-central1 bucket. Use the Storage Transfer Service to copy data hourly from the us-central1 bucket to the us-south1 bucket. 3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in that region. 4. In case of regional failure, redeploy your Dataproc clusters to the us-south1 region and read from the bucket in that region instead.
Answers
Suggested answer: B

Explanation:

To ensure data recovery with minimal data loss and low latency in case of a single region failure, the best approach is to use a dual-region bucket with turbo replication. Here's why option B is the best choice:

Dual-Region Bucket:

A dual-region bucket provides geo-redundancy by replicating data across two regions, ensuring high availability and resilience against regional failures.

The chosen regions (us-central1 and us-south1) provide geographic diversity within the United States.

Turbo Replication:

Turbo replication ensures that data is replicated between the two regions within 15 minutes, meeting the Recovery Point Objective (RPO) of 15 minutes.

This minimizes data loss in case of a regional failure.

Running Dataproc Cluster:

Running the Dataproc cluster in the same region as the primary data storage (us-central1) ensures minimal latency for normal operations.

In case of a regional failure, redeploying the Dataproc cluster to the secondary region (us-south1) ensures continuity with minimal data loss.

Steps to Implement:

Create a Dual-Region Bucket:

Set up a dual-region bucket in the Google Cloud Console, selecting us-central1 and us-south1 regions.

Enable turbo replication to ensure rapid data replication between the regions.

Deploy Dataproc Cluster:

Deploy the Dataproc cluster in the us-central1 region to read data from the bucket located in the same region for optimal performance.

Set Up Failover Plan:

Plan for redeployment of the Dataproc cluster to the us-south1 region in case of a failure in the us-central1 region.

Ensure that the failover process is well-documented and tested to minimize downtime and data loss.

Google Cloud Storage Dual-Region

Turbo Replication in Google Cloud Storage

Dataproc Documentation

Different teams in your organization store customer and performance data in BigOuery. Each team needs to keep full control of their collected data, be able to query data within their projects, and be able to exchange their data with other teams. You need to implement an organization-wide solution, while minimizing operational tasks and costs. What should you do?

A.
Create a BigQuery scheduled query to replicate all customer data into team projects.
A.
Create a BigQuery scheduled query to replicate all customer data into team projects.
Answers
B.
Enable each team to create materialized views of the data they need to access in their projects.
B.
Enable each team to create materialized views of the data they need to access in their projects.
Answers
C.
Ask each team to publish their data in Analytics Hub. Direct the other teams to subscribe to them.
C.
Ask each team to publish their data in Analytics Hub. Direct the other teams to subscribe to them.
Answers
D.
Ask each team to create authorized views of their data. Grant the biquery. jobUser role to each team.
D.
Ask each team to create authorized views of their data. Grant the biquery. jobUser role to each team.
Answers
Suggested answer: C

Explanation:

To enable different teams to manage their own data while allowing data exchange across the organization, using Analytics Hub is the best approach. Here's why option C is the best choice:

Analytics Hub:

Analytics Hub allows teams to publish their data as data exchanges, making it easy for other teams to discover and subscribe to the data they need.

This approach maintains each team's control over their data while facilitating easy and secure data sharing across the organization.

Data Publishing and Subscribing:

Teams can publish datasets they control, allowing them to manage access and updates independently.

Other teams can subscribe to these published datasets, ensuring they have access to the latest data without duplicating efforts.

Minimized Operational Tasks and Costs:

This method reduces the need for complex replication or data synchronization processes, minimizing operational overhead.

By centralizing data sharing through Analytics Hub, it also reduces storage costs associated with duplicating large datasets.

Steps to Implement:

Set Up Analytics Hub:

Enable Analytics Hub in your Google Cloud project.

Provide training to teams on how to publish and subscribe to data exchanges.

Publish Data:

Each team publishes their datasets in Analytics Hub, configuring access controls and metadata as needed.

Subscribe to Data:

Teams that need access to data from other teams can subscribe to the relevant data exchanges, ensuring they always have up-to-date data.

Analytics Hub Documentation

Publishing Data in Analytics Hub

Subscribing to Data in Analytics Hub

You are deploying a batch pipeline in Dataflow. This pipeline reads data from Cloud Storage, transforms the data, and then writes the data into BigQuory. The security team has enabled an organizational constraint in Google Cloud, requiring all Compute Engine instances to use only internal IP addresses and no external IP addresses. What should you do?

A.
Ensure that the firewall rules allow access to Cloud Storage and BigQuery. Use Dataflow with only internal IPs.
A.
Ensure that the firewall rules allow access to Cloud Storage and BigQuery. Use Dataflow with only internal IPs.
Answers
B.
Ensure that your workers have network tags to access Cloud Storage and BigQuery. Use Dataflow with only internal IP addresses.
B.
Ensure that your workers have network tags to access Cloud Storage and BigQuery. Use Dataflow with only internal IP addresses.
Answers
C.
Create a VPC Service Controls perimeter that contains the VPC network and add Dataflow. Cloud Storage, and BigQuery as allowed services in the perimeter. Use Dataflow with only internal IP addresses.
C.
Create a VPC Service Controls perimeter that contains the VPC network and add Dataflow. Cloud Storage, and BigQuery as allowed services in the perimeter. Use Dataflow with only internal IP addresses.
Answers
D.
Ensure that Private Google Access is enabled in the subnetwork. Use Dataflow with only internal IP addresses.
D.
Ensure that Private Google Access is enabled in the subnetwork. Use Dataflow with only internal IP addresses.
Answers
Suggested answer: D

Explanation:

To deploy a batch pipeline in Dataflow that adheres to the organizational constraint of using only internal IP addresses, ensuring Private Google Access is the most effective solution. Here's why option D is the best choice:

Private Google Access:

Private Google Access allows resources in a VPC network that do not have external IP addresses to access Google APIs and services through internal IP addresses.

This ensures compliance with the organizational constraint of using only internal IPs while allowing Dataflow to access Cloud Storage and BigQuery.

Dataflow with Internal IPs:

Dataflow can be configured to use only internal IP addresses for its worker nodes, ensuring that no external IP addresses are assigned.

This configuration ensures secure and compliant communication between Dataflow, Cloud Storage, and BigQuery.

Firewall and Network Configuration:

Enabling Private Google Access requires ensuring the correct firewall rules and network configurations to allow internal traffic to Google Cloud services.

Steps to Implement:

Enable Private Google Access:

Enable Private Google Access on the subnetwork used by the Dataflow pipeline

gcloud compute networks subnets update [SUBNET_NAME] \

--region [REGION] \

--enable-private-ip-google-access

Configure Dataflow:

Configure the Dataflow job to use only internal IP addresses

gcloud dataflow jobs run [JOB_NAME] \

--region [REGION] \

--network [VPC_NETWORK] \

--subnetwork [SUBNETWORK] \

--no-use-public-ips

Verify Access:

Ensure that firewall rules allow the necessary traffic from the Dataflow workers to Cloud Storage and BigQuery using internal IPs.

Private Google Access Documentation

Configuring Dataflow to Use Internal IPs

VPC Firewall Rules

You currently use a SQL-based tool to visualize your data stored in BigQuery The data visualizations require the use of outer joins and analytic functions. Visualizations must be based on data that is no less than 4 hours old. Business users are complaining that the visualizations are too slow to generate. You want to improve the performance of the visualization queries while minimizing the maintenance overhead of the data preparation pipeline. What should you do?

A.
Create materialized views with the allow_non_incremental_definition option set to true for the visualization queries. Specify the max_3taleness parameter to 4 hours and the enable_refresh parameter to true.Reference: the materialized views in the data visualization tool.
A.
Create materialized views with the allow_non_incremental_definition option set to true for the visualization queries. Specify the max_3taleness parameter to 4 hours and the enable_refresh parameter to true.Reference: the materialized views in the data visualization tool.
Answers
B.
Create views for the visualization queries.Reference: the views in the data visualization tool.
B.
Create views for the visualization queries.Reference: the views in the data visualization tool.
Answers
C.
Create materialized views for the visualization queries. Use the incremental updates capability of BigQuery materialized views to handle changed data automatically.Reference: the materialized views in the data visualization tool.
C.
Create materialized views for the visualization queries. Use the incremental updates capability of BigQuery materialized views to handle changed data automatically.Reference: the materialized views in the data visualization tool.
Answers
D.
Create a Cloud Function instance to export the visualization query results as parquet files to a Cloud Storage bucket. Use Cloud Scheduler to trigger the Cloud Function every 4 hours.Reference: the parquet files in the data visualization tool.
D.
Create a Cloud Function instance to export the visualization query results as parquet files to a Cloud Storage bucket. Use Cloud Scheduler to trigger the Cloud Function every 4 hours.Reference: the parquet files in the data visualization tool.
Answers
Suggested answer: C

Explanation:

To improve the performance of visualization queries while minimizing maintenance overhead, using materialized views is the most effective solution. Here's why option C is the best choice:

Materialized Views:

Materialized views store the results of a query physically, allowing for faster access compared to regular views which execute the query each time it is accessed.

They can be automatically refreshed to reflect changes in the underlying data.

Incremental Updates:

The incremental updates capability of BigQuery materialized views ensures that only the changed data is processed during refresh operations, significantly improving performance and reducing computation costs.

This feature helps maintain up-to-date data in the materialized view with minimal processing time, which is crucial for data that needs to be no less than 4 hours old.

Performance and Maintenance:

By using materialized views, you can pre-compute and store the results of complex queries involving outer joins and analytic functions, resulting in faster query performance for data visualizations.

This approach also reduces the maintenance overhead, as BigQuery handles the incremental updates and refreshes automatically.

Steps to Implement:

Create Materialized Views:

Define materialized views for the visualization queries with the necessary configurations

CREATE MATERIALIZED VIEW project.dataset.view_name

AS

SELECT ...

FROM ...

WHERE ...

Enable Incremental Updates:

Ensure that the materialized views are set up to handle incremental updates automatically.

Update the data visualization tool to reference the materialized views instead of running the original queries directly.

BigQuery Materialized Views

Optimizing Query Performance

Your company's customer_order table in BigOuery stores the order history for 10 million customers, with a table size of 10 PB. You need to create a dashboard for the support team to view the order history. The dashboard has two filters, countryname and username. Both are string data types in the BigQuery table. When a filter is applied, the dashboard fetches the order history from the table and displays the query results. However, the dashboard is slow to show the results when applying the filters to the following query:

How should you redesign the BigQuery table to support faster access?

A.
Cluster the table by country field, and partition by username field.
A.
Cluster the table by country field, and partition by username field.
Answers
B.
Partition the table by country and username fields.
B.
Partition the table by country and username fields.
Answers
C.
Cluster the table by country and username fields
C.
Cluster the table by country and username fields
Answers
D.
Partition the table by _PARTITIONTIME.
D.
Partition the table by _PARTITIONTIME.
Answers
Suggested answer: C

Explanation:

To improve the performance of querying a large BigQuery table with filters on countryname and username, clustering the table by these fields is the most effective approach. Here's why option C is the best choice:

Clustering in BigQuery:

Clustering organizes data based on the values in specified columns. This can significantly improve query performance by reducing the amount of data scanned during query execution.

Clustering by countryname and username means that data is physically sorted and stored together based on these fields, allowing BigQuery to quickly locate and read only the relevant data for queries using these filters.

Filter Efficiency:

With the table clustered by countryname and username, queries that filter on these columns can benefit from efficient data retrieval, reducing the amount of data processed and speeding up query execution.

This directly addresses the performance issue of the dashboard queries that apply filters on these fields.

Steps to Implement:

Redesign the Table:

Create a new table with clustering on countryname and username:

CREATE TABLE project.dataset.new_table

CLUSTER BY countryname, username AS

SELECT * FROM project.dataset.customer_order;

Migrate Data:

Transfer the existing data from the original table to the new clustered table.

Update Queries:

Modify the dashboard queries to reference the new clustered table.

BigQuery Clustering Documentation

Optimizing Query Performance

You need to connect multiple applications with dynamic public IP addresses to a Cloud SQL instance. You configured users with strong passwords and enforced the SSL connection to your Cloud SOL instance. You want to use Cloud SQL public IP and ensure that you have secured connections. What should you do?

A.
Add all application networks to Authorized Network and regularly update them.
A.
Add all application networks to Authorized Network and regularly update them.
Answers
B.
Add CIDR 0.0.0.0/0 network to Authorized Network. Use Identity and Access Management (1AM) to add users.
B.
Add CIDR 0.0.0.0/0 network to Authorized Network. Use Identity and Access Management (1AM) to add users.
Answers
C.
Leave the Authorized Network empty. Use Cloud SQL Auth proxy on all applications.
C.
Leave the Authorized Network empty. Use Cloud SQL Auth proxy on all applications.
Answers
D.
Add CIDR 0.0.0.0/0 network to Authorized Network. Use Cloud SOL Auth proxy on all applications.
D.
Add CIDR 0.0.0.0/0 network to Authorized Network. Use Cloud SOL Auth proxy on all applications.
Answers
Suggested answer: C

Explanation:

To securely connect multiple applications with dynamic public IP addresses to a Cloud SQL instance using public IP, the Cloud SQL Auth proxy is the best solution. This proxy provides secure, authorized connections to Cloud SQL instances without the need to configure authorized networks or deal with IP whitelisting complexities.

Cloud SQL Auth Proxy:

The Cloud SQL Auth proxy provides secure, encrypted connections to Cloud SQL.

It uses IAM permissions and SSL to authenticate and encrypt the connection, ensuring data security in transit.

By using the proxy, you avoid the need to constantly update authorized networks as the proxy handles dynamic IP addresses seamlessly.

Authorized Network Configuration:

Leaving the authorized network empty means no IP addresses are explicitly whitelisted, relying solely on the Auth proxy for secure connections.

This approach simplifies network management and enhances security by not exposing the Cloud SQL instance to public IP ranges.

Dynamic IP Handling:

Applications with dynamic IP addresses can securely connect through the proxy without the need to modify authorized networks.

The proxy authenticates connections using IAM, making it ideal for environments where application IPs change frequently.

Google Data Engineer

Reference:

Using Cloud SQL Auth Proxy

Cloud SQL Security Overview

Setting up the Cloud SQL Auth Proxy

By using the Cloud SQL Auth proxy, you ensure secure, authorized connections for applications with dynamic public IPs without the need for complex network configurations.

You are creating the CI'CD cycle for the code of the directed acyclic graphs (DAGs) running in Cloud Composer. Your team has two Cloud Composer instances: one instance for development and another instance for production. Your team is using a Git repository to maintain and develop the code of the DAGs. You want to deploy the DAGs automatically to Cloud Composer when a certain tag is pushed to the Git repository. What should you do?

A.
1. Use Cloud Build to build a container and the Kubemetes Pod Operator to deploy the code of the DAG to the Google Kubernetes Engine (GKE) cluster of the development instance for testing. 2. If the tests pass, copy the code to the Cloud Storage bucket of the production instance.
A.
1. Use Cloud Build to build a container and the Kubemetes Pod Operator to deploy the code of the DAG to the Google Kubernetes Engine (GKE) cluster of the development instance for testing. 2. If the tests pass, copy the code to the Cloud Storage bucket of the production instance.
Answers
B.
1 Use Cloud Build to copy the code of the DAG to the Cloud Storage bucket of the development instance for DAG testing. 2. If the tests pass, use Cloud Build to build a container with the code of the DAG and the KubernetesPodOperator to deploy the container to the Google Kubernetes Engine (GKE) cluster of the production instance.
B.
1 Use Cloud Build to copy the code of the DAG to the Cloud Storage bucket of the development instance for DAG testing. 2. If the tests pass, use Cloud Build to build a container with the code of the DAG and the KubernetesPodOperator to deploy the container to the Google Kubernetes Engine (GKE) cluster of the production instance.
Answers
C.
1 Use Cloud Build to build a container with the code of the DAG and the KubernetesPodOperator to deploy the code to the Google Kubernetes Engine (GKE) cluster of the development instance for testing. 2. If the tests pass, use the KubernetesPodOperator to deploy the container to the GKE cluster of the production instance.
C.
1 Use Cloud Build to build a container with the code of the DAG and the KubernetesPodOperator to deploy the code to the Google Kubernetes Engine (GKE) cluster of the development instance for testing. 2. If the tests pass, use the KubernetesPodOperator to deploy the container to the GKE cluster of the production instance.
Answers
D.
1 Use Cloud Build to copy the code of the DAG to the Cloud Storage bucket of the development instance for DAG testing. 2. If the tests pass, use Cloud Build to copy the code to the bucket of the production instance.
D.
1 Use Cloud Build to copy the code of the DAG to the Cloud Storage bucket of the development instance for DAG testing. 2. If the tests pass, use Cloud Build to copy the code to the bucket of the production instance.
Answers
Suggested answer: C

You have two projects where you run BigQuery jobs:

* One project runs production jobs that have strict completion time SLAs. These are high priority jobs that must have the required compute resources available when needed. These jobs generally never go below a 300 slot utilization, but occasionally spike up an additional 500 slots.

* The other project is for users to run ad-hoc analytical queries. This project generally never uses more than 200 slots at a time. You want these ad-hoc queries to be billed based on how much data users scan rather than by slot capacity.

You need to ensure that both projects have the appropriate compute resources available. What should you do?

A.
Create a single Enterprise Edition reservation for both projects. Set a baseline of 300 slots. Enable autoscaling up to 700 slots.
A.
Create a single Enterprise Edition reservation for both projects. Set a baseline of 300 slots. Enable autoscaling up to 700 slots.
Answers
B.
Create two reservations, one for each of the projects. For the SLA project, use an Enterprise Edition with a baseline of 300 slots and enable autoscaling up to 500 slots. For the ad-hoc project, configure on-demand billing.
B.
Create two reservations, one for each of the projects. For the SLA project, use an Enterprise Edition with a baseline of 300 slots and enable autoscaling up to 500 slots. For the ad-hoc project, configure on-demand billing.
Answers
C.
Create two Enterprise Edition reservations, one for each of the projects. For the SLA project, set a baseline of 300 slots and enable autoscaling up to 500 slots. For the ad-hoc project, set a reservation baseline of 0 slots and set the ignore_idle_slot3 flag to False.
C.
Create two Enterprise Edition reservations, one for each of the projects. For the SLA project, set a baseline of 300 slots and enable autoscaling up to 500 slots. For the ad-hoc project, set a reservation baseline of 0 slots and set the ignore_idle_slot3 flag to False.
Answers
D.
Create two Enterprise Edition reservations, one for each of the projects. For the SLA project, set a baseline of 800 slots. For the ad-hoc project, enable autoscaling up to 200 slots.
D.
Create two Enterprise Edition reservations, one for each of the projects. For the SLA project, set a baseline of 800 slots. For the ad-hoc project, enable autoscaling up to 200 slots.
Answers
Suggested answer: B

Explanation:

To ensure that both production jobs with strict SLAs and ad-hoc queries have appropriate compute resources available while adhering to cost efficiency, setting up separate reservations and billing models for each project is the best approach. Here's why option B is the best choice:

Separate Reservations for SLA and Ad-hoc Projects:

Creating two separate reservations allows for dedicated resource management tailored to the needs of each project.

The production project requires guaranteed slots with the ability to scale up as needed, while the ad-hoc project benefits from on-demand billing based on data scanned.

Enterprise Edition Reservation for SLA Project:

Setting a baseline of 300 slots ensures that the SLA project has the minimum required resources.

Enabling autoscaling up to 500 additional slots allows the project to handle occasional spikes in workload without compromising on SLAs.

On-Demand Billing for Ad-hoc Project:

Using on-demand billing for the ad-hoc project ensures cost efficiency, as users are billed based on the amount of data scanned rather than reserved slot capacity.

This model suits the less predictable and often lower-utilization nature of ad-hoc queries.

Steps to Implement:

Set Up Enterprise Edition Reservation for SLA Project:

Create a reservation with a baseline of 300 slots.

Enable autoscaling to allow up to an additional 500 slots as needed.

Configure On-Demand Billing for Ad-hoc Project:

Ensure that the ad-hoc project is set up to use on-demand billing, which charges based on data scanned by the queries.

Monitor and Adjust:

Continuously monitor the usage and performance of both projects to ensure that the configurations meet the needs and make adjustments as necessary.

BigQuery Slot Reservations

BigQuery On-Demand Pricing

You are administering a BigQuery on-demand environment. Your business intelligence tool is submitting hundreds of queries each day that aggregate a large (50 TB) sales history fact table at the day and month levels. These queries have a slow response time and are exceeding cost expectations. You need to decrease response time, lower query costs, and minimize maintenance. What should you do?

A.
Build materialized views on top of the sales table to aggregate data at the day and month level.
A.
Build materialized views on top of the sales table to aggregate data at the day and month level.
Answers
B.
Build authorized views on top of the sales table to aggregate data at the day and month level.
B.
Build authorized views on top of the sales table to aggregate data at the day and month level.
Answers
C.
Enable Bl Engine and add your sales table as a preferred table.
C.
Enable Bl Engine and add your sales table as a preferred table.
Answers
D.
Create a scheduled query to build sales day and sales month aggregate tables on an hourly basis.
D.
Create a scheduled query to build sales day and sales month aggregate tables on an hourly basis.
Answers
Suggested answer: A

Explanation:

To improve response times and reduce costs for frequent queries aggregating a large sales history fact table, materialized views are a highly effective solution. Here's why option A is the best choice:

Materialized Views:

Materialized views store the results of a query physically and update them periodically, offering faster query responses for frequently accessed data.

They are designed to improve performance for repetitive and expensive aggregation queries by precomputing the results.

Efficiency and Cost Reduction:

By building materialized views at the day and month level, you significantly reduce the computation required for each query, leading to faster response times and lower query costs.

Materialized views also reduce the need for on-demand query execution, which can be costly when dealing with large datasets.

Minimized Maintenance:

Materialized views in BigQuery are managed automatically, with updates handled by the system, reducing the maintenance burden on your team.

Steps to Implement:

Identify Aggregation Queries:

Analyze the existing queries to identify common aggregation patterns at the day and month levels.

Create Materialized Views:

Create materialized views in BigQuery for the identified aggregation patterns. For example

CREATE MATERIALIZED VIEW project.dataset.sales_daily_summary AS

SELECT

DATE(transaction_time) AS day,

SUM(amount) AS total_sales

FROM

project.dataset.sales

GROUP BY

day;

CREATE MATERIALIZED VIEW project.dataset.sales_monthly_summary AS

SELECT

EXTRACT(YEAR FROM transaction_time) AS year,

EXTRACT(MONTH FROM transaction_time) AS month,

SUM(amount) AS total_sales

FROM

project.dataset.sales

GROUP BY

year, month;

Query Using Materialized Views:

Update existing queries to use the materialized views instead of directly querying the base table.

BigQuery Materialized Views

Optimizing Query Performance

Total 372 questions
Go to page: of 38