ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 30

Question list
Search
Search

List of questions

Search

Related questions











You have 100 GB of data stored in a BigQuery table. This data is outdated and will only be accessed one or two times a year for analytics with SQL. For backup purposes, you want to store this data to be immutable for 3 years. You want to minimize storage costs. What should you do?

A.
1 Create a BigQuery table clone. 2. Query the clone when you need to perform analytics.
A.
1 Create a BigQuery table clone. 2. Query the clone when you need to perform analytics.
Answers
B.
1 Create a BigQuery table snapshot. 2 Restore the snapshot when you need to perform analytics.
B.
1 Create a BigQuery table snapshot. 2 Restore the snapshot when you need to perform analytics.
Answers
C.
1. Perform a BigQuery export to a Cloud Storage bucket with archive storage class. 2 Enable versionmg on the bucket. 3. Create a BigQuery external table on the exported files.
C.
1. Perform a BigQuery export to a Cloud Storage bucket with archive storage class. 2 Enable versionmg on the bucket. 3. Create a BigQuery external table on the exported files.
Answers
D.
1 Perform a BigQuery export to a Cloud Storage bucket with archive storage class. 2 Set a locked retention policy on the bucket. 3. Create a BigQuery external table on the exported files.
D.
1 Perform a BigQuery export to a Cloud Storage bucket with archive storage class. 2 Set a locked retention policy on the bucket. 3. Create a BigQuery external table on the exported files.
Answers
Suggested answer: D

Explanation:

This option will allow you to store the data in a low-cost storage option, as the archive storage class has the lowest price per GB among the Cloud Storage classes. It will also ensure that the data is immutable for 3 years, as the locked retention policy prevents the deletion or overwriting of the data until the retention period expires. You can still query the data using SQL by creating a BigQuery external table that references the exported files in the Cloud Storage bucket. Option A is incorrect because creating a BigQuery table clone will not reduce the storage costs, as the clone will have the same size and storage class as the original table. Option B is incorrect because creating a BigQuery table snapshot will also not reduce the storage costs, as the snapshot will have the same size and storage class as the original table. Option C is incorrect because enabling versioning on the bucket will not make the data immutable, as the versions can still be deleted or overwritten by anyone with the appropriate permissions. It will also increase the storage costs, as each version of the file will be charged separately.Reference:

Exporting table data | BigQuery | Google Cloud

Storage classes | Cloud Storage | Google Cloud

Retention policies and retention periods | Cloud Storage | Google Cloud

Federated queries | BigQuery | Google Cloud

You have designed an Apache Beam processing pipeline that reads from a Pub/Sub topic. The topic has a message retention duration of one day, and writes to a Cloud Storage bucket. You need to select a bucket location and processing strategy to prevent data loss in case of a regional outage with an RPO of 15 minutes. What should you do?

A.
1 Use a regional Cloud Storage bucket 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by one day to recover the acknowledged messages 4 Start the Dataflow job in a secondary region and write in a bucket in the same region
A.
1 Use a regional Cloud Storage bucket 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by one day to recover the acknowledged messages 4 Start the Dataflow job in a secondary region and write in a bucket in the same region
Answers
B.
1 Use a multi-regional Cloud Storage bucket 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 60 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region
B.
1 Use a multi-regional Cloud Storage bucket 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 60 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region
Answers
C.
1. Use a dual-region Cloud Storage bucket. 2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 15 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region
C.
1. Use a dual-region Cloud Storage bucket. 2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 15 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region
Answers
D.
1. Use a dual-region Cloud Storage bucket with turbo replication enabled 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 60 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region.
D.
1. Use a dual-region Cloud Storage bucket with turbo replication enabled 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 60 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region.
Answers
Suggested answer: C

Explanation:

A dual-region Cloud Storage bucket is a type of bucket that stores data redundantly across two regions within the same continent. This provides higher availability and durability than a regional bucket, which stores data in a single region. A dual-region bucket also provides lower latency and higher throughput than a multi-regional bucket, which stores data across multiple regions within a continent or across continents. A dual-region bucket with turbo replication enabled is a premium option that offers even faster replication across regions, but it is more expensive and not necessary for this scenario.

By using a dual-region Cloud Storage bucket, you can ensure that your data is protected from regional outages, and that you can access it from either region with low latency and high performance. You can also monitor the Dataflow metrics with Cloud Monitoring to determine when an outage occurs, and seek the subscription back in time by 15 minutes to recover the acknowledged messages. Seeking a subscription allows you to replay the messages from a Pub/Sub topic that were published within the message retention duration, which is one day in this case. By seeking the subscription back in time by 15 minutes, you can meet the RPO of 15 minutes, which means the maximum amount of data loss that is acceptable for your business. You can then start the Dataflow job in a secondary region and write to the same dual-region bucket, which will resume the processing of the messages and prevent data loss.

Option A is not a good solution, as using a regional Cloud Storage bucket does not provide any redundancy or protection from regional outages. If the region where the bucket is located experiences an outage, you will not be able to access your data or write new data to the bucket. Seeking the subscription back in time by one day is also unnecessary and inefficient, as it will replay all the messages from the past day, even though you only need to recover the messages from the past 15 minutes.

Option B is not a good solution, as using a multi-regional Cloud Storage bucket does not provide the best performance or cost-efficiency for this scenario. A multi-regional bucket stores data across multiple regions within a continent or across continents, which provides higher availability and durability than a dual-region bucket, but also higher latency and lower throughput. A multi-regional bucket is more suitable for serving data to a global audience, not for processing data with Dataflow within a single continent. Seeking the subscription back in time by 60 minutes is also unnecessary and inefficient, as it will replay more messages than needed to meet the RPO of 15 minutes.

Option D is not a good solution, as using a dual-region Cloud Storage bucket with turbo replication enabled does not provide any additional benefit for this scenario, but only increases the cost. Turbo replication is a premium option that offers faster replication across regions, but it is not required to meet the RPO of 15 minutes. Seeking the subscription back in time by 60 minutes is also unnecessary and inefficient, as it will replay more messages than needed to meet the RPO of 15 minutes.Reference:Storage locations | Cloud Storage | Google Cloud,Dataflow metrics | Cloud Dataflow | Google Cloud,Seeking a subscription | Cloud Pub/Sub | Google Cloud,Recovery point objective (RPO) | Acronis.

You have designed an Apache Beam processing pipeline that reads from a Pub/Sub topic. The topic has a message retention duration of one day, and writes to a Cloud Storage bucket. You need to select a bucket location and processing strategy to prevent data loss in case of a regional outage with an RPO of 15 minutes. What should you do?

A.
1 Use a regional Cloud Storage bucket 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by one day to recover the acknowledged messages 4 Start the Dataflow job in a secondary region and write in a bucket in the same region
A.
1 Use a regional Cloud Storage bucket 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by one day to recover the acknowledged messages 4 Start the Dataflow job in a secondary region and write in a bucket in the same region
Answers
B.
1 Use a multi-regional Cloud Storage bucket 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 60 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region
B.
1 Use a multi-regional Cloud Storage bucket 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 60 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region
Answers
C.
1. Use a dual-region Cloud Storage bucket. 2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 15 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region
C.
1. Use a dual-region Cloud Storage bucket. 2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 15 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region
Answers
D.
1. Use a dual-region Cloud Storage bucket with turbo replication enabled 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 60 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region.
D.
1. Use a dual-region Cloud Storage bucket with turbo replication enabled 2 Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs 3 Seek the subscription back in time by 60 minutes to recover the acknowledged messages 4 Start the Dataflow job in a secondary region.
Answers
Suggested answer: C

Explanation:

A dual-region Cloud Storage bucket is a type of bucket that stores data redundantly across two regions within the same continent. This provides higher availability and durability than a regional bucket, which stores data in a single region. A dual-region bucket also provides lower latency and higher throughput than a multi-regional bucket, which stores data across multiple regions within a continent or across continents. A dual-region bucket with turbo replication enabled is a premium option that offers even faster replication across regions, but it is more expensive and not necessary for this scenario.

By using a dual-region Cloud Storage bucket, you can ensure that your data is protected from regional outages, and that you can access it from either region with low latency and high performance. You can also monitor the Dataflow metrics with Cloud Monitoring to determine when an outage occurs, and seek the subscription back in time by 15 minutes to recover the acknowledged messages. Seeking a subscription allows you to replay the messages from a Pub/Sub topic that were published within the message retention duration, which is one day in this case. By seeking the subscription back in time by 15 minutes, you can meet the RPO of 15 minutes, which means the maximum amount of data loss that is acceptable for your business. You can then start the Dataflow job in a secondary region and write to the same dual-region bucket, which will resume the processing of the messages and prevent data loss.

Option A is not a good solution, as using a regional Cloud Storage bucket does not provide any redundancy or protection from regional outages. If the region where the bucket is located experiences an outage, you will not be able to access your data or write new data to the bucket. Seeking the subscription back in time by one day is also unnecessary and inefficient, as it will replay all the messages from the past day, even though you only need to recover the messages from the past 15 minutes.

Option B is not a good solution, as using a multi-regional Cloud Storage bucket does not provide the best performance or cost-efficiency for this scenario. A multi-regional bucket stores data across multiple regions within a continent or across continents, which provides higher availability and durability than a dual-region bucket, but also higher latency and lower throughput. A multi-regional bucket is more suitable for serving data to a global audience, not for processing data with Dataflow within a single continent. Seeking the subscription back in time by 60 minutes is also unnecessary and inefficient, as it will replay more messages than needed to meet the RPO of 15 minutes.

Option D is not a good solution, as using a dual-region Cloud Storage bucket with turbo replication enabled does not provide any additional benefit for this scenario, but only increases the cost. Turbo replication is a premium option that offers faster replication across regions, but it is not required to meet the RPO of 15 minutes. Seeking the subscription back in time by 60 minutes is also unnecessary and inefficient, as it will replay more messages than needed to meet the RPO of 15 minutes.Reference:Storage locations | Cloud Storage | Google Cloud,Dataflow metrics | Cloud Dataflow | Google Cloud,Seeking a subscription | Cloud Pub/Sub | Google Cloud,Recovery point objective (RPO) | Acronis.

You have data located in BigQuery that is used to generate reports for your company. You have noticed some weekly executive report fields do not correspond to format according to company standards for example, report errors include different telephone formats and different country code identifiers. This is a frequent issue, so you need to create a recurring job to normalize the data. You want a quick solution that requires no coding What should you do?

A.
Use Cloud Data Fusion and Wrangler to normalize the data, and set up a recurring job.
A.
Use Cloud Data Fusion and Wrangler to normalize the data, and set up a recurring job.
Answers
B.
Use BigQuery and GoogleSQL to normalize the data, and schedule recurring quenes in BigQuery.
B.
Use BigQuery and GoogleSQL to normalize the data, and schedule recurring quenes in BigQuery.
Answers
C.
Create a Spark job and submit it to Dataproc Serverless.
C.
Create a Spark job and submit it to Dataproc Serverless.
Answers
D.
Use Dataflow SQL to create a job that normalizes the data, and that after the first run of the job, schedule the pipeline to execute recurrently.
D.
Use Dataflow SQL to create a job that normalizes the data, and that after the first run of the job, schedule the pipeline to execute recurrently.
Answers
Suggested answer: A

Explanation:

Cloud Data Fusion is a fully managed, cloud-native data integration service that allows you to build and manage data pipelines with a graphical interface. Wrangler is a feature of Cloud Data Fusion that enables you to interactively explore, clean, and transform data using a spreadsheet-like UI. You can use Wrangler to normalize the data in BigQuery by applying various directives, such as parsing, formatting, replacing, and validating data. You can also preview the results and export the wrangled data to BigQuery or other destinations. You can then set up a recurring job in Cloud Data Fusion to run the Wrangler pipeline on a schedule, such as weekly or daily. This way, you can create a quick and code-free solution to normalize the data for your reports.Reference:

Cloud Data Fusion overview

Wrangler overview

Wrangle data from BigQuery

[Scheduling pipelines]

You are migrating a large number of files from a public HTTPS endpoint to Cloud Storage. The files are protected from unauthorized access using signed URLs. You created a TSV file that contains the list of object URLs and started a transfer job by using Storage Transfer Service. You notice that the job has run for a long time and eventually failed Checking the logs of the transfer job reveals that the job was running fine until one point, and then it failed due to HTTP 403 errors on the remaining files You verified that there were no changes to the source system You need to fix the problem to resume the migration process. What should you do?

A.
Set up Cloud Storage FUSE, and mount the Cloud Storage bucket on a Compute Engine Instance Remove the completed files from the TSV file Use a shell script to iterate through the TSV file and download the remaining URLs to the FUSE mount point.
A.
Set up Cloud Storage FUSE, and mount the Cloud Storage bucket on a Compute Engine Instance Remove the completed files from the TSV file Use a shell script to iterate through the TSV file and download the remaining URLs to the FUSE mount point.
Answers
B.
Update the file checksums in the TSV file from using MD5 to SHA256. Remove the completed files from the TSV file and rerun the Storage Transfer Service job.
B.
Update the file checksums in the TSV file from using MD5 to SHA256. Remove the completed files from the TSV file and rerun the Storage Transfer Service job.
Answers
C.
Renew the TLS certificate of the HTTPS endpoint Remove the completed files from the TSV file and rerun the Storage Transfer Service job.
C.
Renew the TLS certificate of the HTTPS endpoint Remove the completed files from the TSV file and rerun the Storage Transfer Service job.
Answers
D.
Create a new TSV file for the remaining files by generating signed URLs with a longer validity period. Split the TSV file into multiple smaller files and submit them as separate Storage Transfer Service jobs in parallel.
D.
Create a new TSV file for the remaining files by generating signed URLs with a longer validity period. Split the TSV file into multiple smaller files and submit them as separate Storage Transfer Service jobs in parallel.
Answers
Suggested answer: D

Explanation:

A signed URL is a URL that provides limited permission and time to access a resource on a web server. It is often used to grant temporary access to protected files without requiring authentication. Storage Transfer Service is a service that allows you to transfer data from external sources, such as HTTPS endpoints, to Cloud Storage buckets. You can use a TSV file to specify the list of URLs to transfer. In this scenario, the most likely cause of the HTTP 403 errors is that the signed URLs have expired before the transfer job could complete. This could happen if the signed URLs have a short validity period or the transfer job takes a long time due to the large number of files or network latency. To fix the problem, you need to create a new TSV file for the remaining files by generating new signed URLs with a longer validity period. This will ensure that the URLs do not expire before the transfer job finishes. You can use the Cloud Storage tools or your own program to generate signed URLs. Additionally, you can split the TSV file into multiple smaller files and submit them as separate Storage Transfer Service jobs in parallel. This will speed up the transfer process and reduce the risk of errors.Reference:

Signed URLs | Cloud Storage Documentation

V4 signing process with Cloud Storage tools

V4 signing process with your own program

Using a URL list file

What Is a 403 Forbidden Error (and How Can I Fix It)?

You want to store your team's shared tables in a single dataset to make data easily accessible to various analysts. You want to make this data readable but unmodifiable by analysts. At the same time, you want to provide the analysts with individual workspaces in the same project, where they can create and store tables for their own use, without the tables being accessible by other analysts. What should you do?

A.
Give analysts the BigQuery Data Viewer role at the project level Create one other dataset, and give the analysts the BigQuery Data Editor role on that dataset.
A.
Give analysts the BigQuery Data Viewer role at the project level Create one other dataset, and give the analysts the BigQuery Data Editor role on that dataset.
Answers
B.
Give analysts the BigQuery Data Viewer role at the project level Create a dataset for each analyst, and give each analyst the BigQuery Data Editor role at the project level.
B.
Give analysts the BigQuery Data Viewer role at the project level Create a dataset for each analyst, and give each analyst the BigQuery Data Editor role at the project level.
Answers
C.
Give analysts the BigQuery Data Viewer role on the shared dataset. Create a dataset for each analyst, and give each analyst the BigQuery Data Editor role at the dataset level for their assigned dataset
C.
Give analysts the BigQuery Data Viewer role on the shared dataset. Create a dataset for each analyst, and give each analyst the BigQuery Data Editor role at the dataset level for their assigned dataset
Answers
D.
Give analysts the BigQuery Data Viewer role on the shared dataset Create one other dataset and give the analysts the BigQuery Data Editor role on that dataset.
D.
Give analysts the BigQuery Data Viewer role on the shared dataset Create one other dataset and give the analysts the BigQuery Data Editor role on that dataset.
Answers
Suggested answer: C

Explanation:

The BigQuery Data Viewer role allows users to read data and metadata from tables and views, but not to modify or delete them. By giving analysts this role on the shared dataset, you can ensure that they can access the data for analysis, but not change it. The BigQuery Data Editor role allows users to create, update, and delete tables and views, as well as read and write data. By giving analysts this role at the dataset level for their assigned dataset, you can provide them with individual workspaces where they can store their own tables and views, without affecting the shared dataset or other analysts' datasets. This way, you can achieve both data protection and data isolation for your team.Reference:

BigQuery IAM roles and permissions

Basic roles and permissions

Your company's data platform ingests CSV file dumps of booking and user profile data from upstream sources into Cloud Storage. The data analyst team wants to join these datasets on the email field available in both the datasets to perform analysis. However, personally identifiable information (PII) should not be accessible to the analysts. You need to de-identify the email field in both the datasets before loading them into BigQuery for analysts. What should you do?

A.
1. Create a pipeline to de-identify the email field by using recordTransformations in Cloud Data Loss Prevention (Cloud DLP) with masking as the de-identification transformations type. 2. Load the booking and user profile data into a BigQuery table.
A.
1. Create a pipeline to de-identify the email field by using recordTransformations in Cloud Data Loss Prevention (Cloud DLP) with masking as the de-identification transformations type. 2. Load the booking and user profile data into a BigQuery table.
Answers
B.
1. Create a pipeline to de-identify the email field by using recordTransformations in Cloud DLP with format-preserving encryption with FFX as the de-identification transformation type. 2. Load the booking and user profile data into a BigQuery table.
B.
1. Create a pipeline to de-identify the email field by using recordTransformations in Cloud DLP with format-preserving encryption with FFX as the de-identification transformation type. 2. Load the booking and user profile data into a BigQuery table.
Answers
C.
1. Load the CSV files from Cloud Storage into a BigQuery table, and enable dynamic data masking. 2. Create a policy tag with the email mask as the data masking rule. 3. Assign the policy to the email field in both tables. A 4. Assign the Identity and Access Management bigquerydatapolicy.maskedReader role for the BigQuery tables to the analysts.
C.
1. Load the CSV files from Cloud Storage into a BigQuery table, and enable dynamic data masking. 2. Create a policy tag with the email mask as the data masking rule. 3. Assign the policy to the email field in both tables. A 4. Assign the Identity and Access Management bigquerydatapolicy.maskedReader role for the BigQuery tables to the analysts.
Answers
D.
1. Load the CSV files from Cloud Storage into a BigQuery table, and enable dynamic data masking. 2. Create a policy tag with the default masking value as the data masking rule. 3. Assign the policy to the email field in both tables. 4. Assign the Identity and Access Management bigquerydatapolicy.maskedReader role for the BigQuery tables to the analysts
D.
1. Load the CSV files from Cloud Storage into a BigQuery table, and enable dynamic data masking. 2. Create a policy tag with the default masking value as the data masking rule. 3. Assign the policy to the email field in both tables. 4. Assign the Identity and Access Management bigquerydatapolicy.maskedReader role for the BigQuery tables to the analysts
Answers
Suggested answer: B

Explanation:

Cloud DLP is a service that helps you discover, classify, and protect your sensitive data. It supports various de-identification techniques, such as masking, redaction, tokenization, and encryption. Format-preserving encryption (FPE) with FFX is a technique that encrypts sensitive data while preserving its original format and length. This allows you to join the encrypted data on the same field without revealing the actual values. FPE with FFX also supports partial encryption, which means you can encrypt only a portion of the data, such as the domain name of an email address. By using Cloud DLP to de-identify the email field with FPE with FFX, you can ensure that the analysts can join the booking and user profile data on the email field without accessing the PII. You can create a pipeline to de-identify the email field by using recordTransformations in Cloud DLP, which allows you to specify the fields and the de-identification transformations to apply to them. You can then load the de-identified data into a BigQuery table for analysis.Reference:

De-identify sensitive data | Cloud Data Loss Prevention Documentation

Format-preserving encryption with FFX | Cloud Data Loss Prevention Documentation

De-identify and re-identify data with the Cloud DLP API

De-identify data in a pipeline

You are creating a data model in BigQuery that will hold retail transaction data. Your two largest tables, sales_transation_header and sales_transation_line. have a tightly coupled immutable relationship. These tables are rarely modified after load and are frequently joined when queried. You need to model the sales_transation_header and sales_transation_line tables to improve the performance of data analytics queries. What should you do?

A.
Create a sal es_transaction table that Stores the sales_tran3action_header and sales_transaction_line data as a JSON data type.
A.
Create a sal es_transaction table that Stores the sales_tran3action_header and sales_transaction_line data as a JSON data type.
Answers
B.
Create a sale3_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields.
B.
Create a sale3_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields.
Answers
C.
Create a sale_transaction table that holds the sales_transaction_header and sales_transaction_line information as rows, duplicating the sales_transaction_header data for each line.
C.
Create a sale_transaction table that holds the sales_transaction_header and sales_transaction_line information as rows, duplicating the sales_transaction_header data for each line.
Answers
D.
Create separate sales_transation_header and sales_transation_line tables and. when querying, specify the sales transition line first in the WHERE clause.
D.
Create separate sales_transation_header and sales_transation_line tables and. when querying, specify the sales transition line first in the WHERE clause.
Answers
Suggested answer: B

Explanation:

BigQuery supports nested and repeated fields, which are complex data types that can represent hierarchical and one-to-many relationships within a single table. By using nested and repeated fields, you can denormalize your data model and reduce the number of joins required for your queries. This can improve the performance and efficiency of your data analytics queries, as joins can be expensive and require shuffling data across nodes. Nested and repeated fields also preserve the data integrity and avoid data duplication. In this scenario, the sales_transaction_header and sales_transaction_line tables have a tightly coupled immutable relationship, meaning that each header row corresponds to one or more line rows, and the data is rarely modified after load. Therefore, it makes sense to create a single sales_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields. This way, you can query the sales transaction data without joining two tables, and use dot notation or array functions to access the nested and repeated fields. For example, the sales_transaction table could have the following schema:

Table

Field name

Type

Mode

id

INTEGER

NULLABLE

order_time

TIMESTAMP

NULLABLE

customer_id

INTEGER

NULLABLE

line_items

RECORD

REPEATED

line_items.sku

STRING

NULLABLE

line_items.quantity

INTEGER

NULLABLE

line_items.price

FLOAT

NULLABLE

To query the total amount of each order, you could use the following SQL statement:

SQL

SELECT id, SUM(line_items.quantity * line_items.price) AS total_amount

FROM sales_transaction

GROUP BY id;

AI-generated code. Review and use carefully.More info on FAQ.

Use nested and repeated fields

BigQuery explained: Working with joins, nested & repeated data

Arrays in BigQuery --- How to improve query performance and optimise storage

You have created an external table for Apache Hive partitioned data that resides in a Cloud Storage bucket, which contains a large number of files. You notice that queries against this table are slow You want to improve the performance of these queries What should you do?

A.
Migrate the Hive partitioned data objects to a multi-region Cloud Storage bucket.
A.
Migrate the Hive partitioned data objects to a multi-region Cloud Storage bucket.
Answers
B.
Create an individual external table for each Hive partition by using a common table name prefix Use wildcard table queries to reference the partitioned data.
B.
Create an individual external table for each Hive partition by using a common table name prefix Use wildcard table queries to reference the partitioned data.
Answers
C.
Change the storage class of the Hive partitioned data objects from Coldline to Standard.
C.
Change the storage class of the Hive partitioned data objects from Coldline to Standard.
Answers
D.
Upgrade the external table to a BigLake table Enable metadata caching for the table.
D.
Upgrade the external table to a BigLake table Enable metadata caching for the table.
Answers
Suggested answer: D

Explanation:

BigLake is a Google Cloud service that allows you to query structured data in external data stores such as Cloud Storage, Amazon S3, and Azure Blob Storage with access delegation and governance. BigLake tables extend the capabilities of BigQuery to data lakes and enable a flexible, open lakehouse architecture. By upgrading an external table to a BigLake table, you can improve the performance of your queries by leveraging the BigQuery storage API, which supports data format conversion, predicate pushdown, column projection, and metadata caching. Metadata caching reduces the number of requests to the external data store and speeds up query execution. To upgrade an external table to a BigLake table, you can use theALTER TABLEstatement with theSET OPTIONSclause and specify theenable_metadata_cachingoption astrue. For example:

SQL

ALTER TABLE hive_partitioned_data

SET OPTIONS (

enable_metadata_caching=true

);

AI-generated code. Review and use carefully.More info on FAQ.

Introduction to BigLake tables

Upgrade an external table to BigLake

BigQuery storage API

Your startup has a web application that currently serves customers out of a single region in Asia. You are targeting funding that will allow your startup lo serve customers globally. Your current goal is to optimize for cost, and your post-funding goat is to optimize for global presence and performance. You must use a native JDBC driver. What should you do?

A.
Use Cloud Spanner to configure a single region instance initially. and then configure multi-region C oud Spanner instances after securing funding.
A.
Use Cloud Spanner to configure a single region instance initially. and then configure multi-region C oud Spanner instances after securing funding.
Answers
B.
Use a Cloud SQL for PostgreSQL highly available instance first, and 8gtable with US. Europe, and Asia replication alter securing funding
B.
Use a Cloud SQL for PostgreSQL highly available instance first, and 8gtable with US. Europe, and Asia replication alter securing funding
Answers
C.
Use a Cloud SQL for PostgreSQL zonal instance first and Bigtable with US. Europe, and Asia after securing funding.
C.
Use a Cloud SQL for PostgreSQL zonal instance first and Bigtable with US. Europe, and Asia after securing funding.
Answers
D.
Use a Cloud SOL for PostgreSQL zonal instance first, and Cloud SOL for PostgreSQL with highly available configuration after securing funding.
D.
Use a Cloud SOL for PostgreSQL zonal instance first, and Cloud SOL for PostgreSQL with highly available configuration after securing funding.
Answers
Suggested answer: A

Explanation:

https://cloud.google.com/spanner/docs/instance-configurations#tradeoffs_regional_versus_multi-region_configurations

Total 372 questions
Go to page: of 38