ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 35

Question list
Search
Search

List of questions

Search

Related questions











You migrated your on-premises Apache Hadoop Distributed File System (HDFS) data lake to Cloud Storage. The data scientist team needs to process the data by using Apache Spark and SQL. Security policies need to be enforced at the column level. You need a cost-effective solution that can scale into a data mesh. What should you do?

A.
1. Deploy a long-living Dalaproc cluster with Apache Hive and Ranger enabled. 2. Configure Ranger for column level security. 3. Process with Dataproc Spark or Hive SQL.
A.
1. Deploy a long-living Dalaproc cluster with Apache Hive and Ranger enabled. 2. Configure Ranger for column level security. 3. Process with Dataproc Spark or Hive SQL.
Answers
B.
1. Define a BigLake table. 2. Create a taxonomy of policy tags in Data Catalog. 3. Add policy lags to columns. 4. Process with the Spark-BigQuery connector or BigQuery SOL.
B.
1. Define a BigLake table. 2. Create a taxonomy of policy tags in Data Catalog. 3. Add policy lags to columns. 4. Process with the Spark-BigQuery connector or BigQuery SOL.
Answers
C.
1. Load the data to BigQuery tables. 2. Create a taxonomy of policy tags in Data Catalog. 3. Add policy tags to columns. 4. Procoss with the Spark-BigQuery connector or BigQuery SQL.
C.
1. Load the data to BigQuery tables. 2. Create a taxonomy of policy tags in Data Catalog. 3. Add policy tags to columns. 4. Procoss with the Spark-BigQuery connector or BigQuery SQL.
Answers
D.
1 Apply an Identity and Access Management (IAM) policy at the file level in Cloud Storage 2. Define a BigQuery external table for SQL processing. 3. Use Dataproc Spark to process the Cloud Storage files.
D.
1 Apply an Identity and Access Management (IAM) policy at the file level in Cloud Storage 2. Define a BigQuery external table for SQL processing. 3. Use Dataproc Spark to process the Cloud Storage files.
Answers
Suggested answer: D

Explanation:

For automating the CI/CD pipeline of DAGs running in Cloud Composer, the following approach ensures that DAGs are tested and deployed in a streamlined and efficient manner.

Use Cloud Build for Development Instance Testing:

Use Cloud Build to automate the process of copying the DAG code to the Cloud Storage bucket of the development instance.

This triggers Cloud Composer to automatically pick up and test the new DAGs in the development environment.

Testing and Validation:

Ensure that the DAGs run successfully in the development environment.

Validate the functionality and correctness of the DAGs before promoting them to production.

Deploy to Production:

If the DAGs pass all tests in the development environment, use Cloud Build to copy the tested DAG code to the Cloud Storage bucket of the production instance.

This ensures that only validated and tested DAGs are deployed to production, maintaining the stability and reliability of the production environment.

Simplicity and Reliability:

This approach leverages Cloud Build's capabilities for automation and integrates seamlessly with Cloud Composer's reliance on Cloud Storage for DAG storage.

By using Cloud Storage for both development and production deployments, the process remains simple and robust.

Google Data Engineer

Reference:

Cloud Composer Documentation

Using Cloud Build

Deploying DAGs to Cloud Composer

Automating DAG Deployment with Cloud Build

By implementing this CI/CD pipeline, you ensure that DAGs are thoroughly tested in the development environment before being automatically deployed to the production environment, maintaining high quality and reliability.

You are administering shared BigQuery datasets that contain views used by multiple teams in your organization. The marketing team is concerned about the variability of their monthly BigQuery analytics spend using the on-demand billing model. You need to help the marketing team establish a consistent BigQuery analytics spend each month. What should you do?

A.
Create a BigQuery Standard pay-as-you go reservation with a baseline of 0 slots and autoscaling set to 500 for the marketing team, and bill them back accordingly.
A.
Create a BigQuery Standard pay-as-you go reservation with a baseline of 0 slots and autoscaling set to 500 for the marketing team, and bill them back accordingly.
Answers
B.
Create a BigQuery reservation with a baseline of 500 slots with no autoscaling for the marketing team, and bill them back accordingly.
B.
Create a BigQuery reservation with a baseline of 500 slots with no autoscaling for the marketing team, and bill them back accordingly.
Answers
C.
Establish a BigQuery quota for the marketing team, and limit the maximum number of bytes scanned each day.
C.
Establish a BigQuery quota for the marketing team, and limit the maximum number of bytes scanned each day.
Answers
D.
Create a BigQuery Enterprise reservation with a baseline of 250 slots and autoscaling set to 500 for the marketing team, and bill them back accordingly.
D.
Create a BigQuery Enterprise reservation with a baseline of 250 slots and autoscaling set to 500 for the marketing team, and bill them back accordingly.
Answers
Suggested answer: B

Explanation:

To help the marketing team establish a consistent BigQuery analytics spend each month, you can use BigQuery reservations to allocate dedicated slots for their queries. This provides predictable costs by reserving a fixed amount of compute resources.

BigQuery Reservations:

BigQuery Reservations allow you to purchase dedicated query processing capacity in the form of slots.

By reserving slots, you can control costs and ensure that the marketing team has the necessary resources for their queries without unexpected increases in spending.

Baseline Slots:

Setting a baseline of 500 slots without autoscaling ensures a consistent allocation of resources.

This provides a predictable monthly cost, as the marketing team will be billed for the reserved slots regardless of actual usage.

Billing Back:

The marketing team's usage can be billed back based on the fixed reservation cost, ensuring budget predictability.

This approach avoids the variability associated with on-demand billing, where costs can fluctuate based on query volume and complexity.

No Autoscaling:

By not enabling autoscaling, you prevent additional costs from being incurred due to temporary increases in query demand.

This fixed reservation ensures that the marketing team only uses the allocated 500 slots, maintaining a consistent monthly spend.

Google Data Engineer

Reference:

BigQuery Reservations Documentation

BigQuery Slot Reservations

Managing BigQuery Costs

Using a fixed reservation of 500 slots provides the marketing team with predictable costs and the necessary resources for their queries without unexpected billing variability.

You need to create a SQL pipeline. The pipeline runs an aggregate SOL transformation on a BigQuery table every two hours and appends the result to another existing BigQuery table. You need to configure the pipeline to retry if errors occur. You want the pipeline to send an email notification after three consecutive failures. What should you do?

A.
Create a BigQuery scheduled query to run the SOL transformation with schedule options that repeats every two hours, and enable email notifications.
A.
Create a BigQuery scheduled query to run the SOL transformation with schedule options that repeats every two hours, and enable email notifications.
Answers
B.
Use the BigQueryUpsertTableOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to true.
B.
Use the BigQueryUpsertTableOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to true.
Answers
C.
Use the BigQuerylnsertJobOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to true.
C.
Use the BigQuerylnsertJobOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to true.
Answers
D.
Create a BigQuery scheduled query to run the SQL transformation with schedule options that repeats every two hours, and enable notification to Pub/Sub topic. Use Pub/Sub and Cloud Functions to send an email after three tailed executions.
D.
Create a BigQuery scheduled query to run the SQL transformation with schedule options that repeats every two hours, and enable notification to Pub/Sub topic. Use Pub/Sub and Cloud Functions to send an email after three tailed executions.
Answers
Suggested answer: D

Explanation:

To create a robust and resilient SQL pipeline in BigQuery that handles retries and failure notifications, consider the following:

BigQuery Scheduled Queries: This feature allows you to schedule recurring queries in BigQuery. It is a straightforward way to run SQL transformations on a regular basis without requiring extensive setup.

Error Handling and Retries: While BigQuery Scheduled Queries can run at specified intervals, they don't natively support complex retry logic or failure notifications directly. This is where additional Google Cloud services like Pub/Sub and Cloud Functions come into play.

Pub/Sub for Notifications: By configuring a BigQuery scheduled query to publish messages to a Pub/Sub topic upon failure, you can create a decoupled and scalable notification system.

Cloud Functions: Cloud Functions can subscribe to the Pub/Sub topic and implement logic to count consecutive failures. After detecting three consecutive failures, the Cloud Function can then send an email notification using a service like SendGrid or Gmail API.

Implementation Steps:

Set up a BigQuery Scheduled Query:

Create a scheduled query in BigQuery to run your SQL transformation every two hours.

Configure the scheduled query to publish a notification to a Pub/Sub topic in case of a failure.

Create a Pub/Sub Topic:

Create a Pub/Sub topic that will receive messages from the scheduled query.

Develop a Cloud Function:

Write a Cloud Function that subscribes to the Pub/Sub topic.

Implement logic in the Cloud Function to track failure messages. If three consecutive failure messages are detected, the function sends an email notification.

BigQuery Scheduled Queries

Pub/Sub Documentation

Cloud Functions Documentation

SendGrid Email API

Gmail API

You are planning to load some of your existing on-premises data into BigQuery on Google Cloud. You want to either stream or batch-load data, depending on your use case. Additionally, you want to mask some sensitive data before loading into BigQuery. You need to do this in a programmatic way while keeping costs to a minimum. What should you do?

A.
Use the BigQuery Data Transfer Service to schedule your migration. After the data is populated in BigQuery. use the connection to the Cloud Data Loss Prevention {Cloud DLP} API to de-identify the necessary data.
A.
Use the BigQuery Data Transfer Service to schedule your migration. After the data is populated in BigQuery. use the connection to the Cloud Data Loss Prevention {Cloud DLP} API to de-identify the necessary data.
Answers
B.
Create your pipeline with Dataflow through the Apache Beam SDK for Python, customizing separate options within your code for streaming. batch processing, and Cloud DLP Select BigQuery as your data sink.
B.
Create your pipeline with Dataflow through the Apache Beam SDK for Python, customizing separate options within your code for streaming. batch processing, and Cloud DLP Select BigQuery as your data sink.
Answers
C.
Use Cloud Data Fusion to design your pipeline, use the Cloud DLP plug-in to de-identify data within your pipeline, and then move the data into BigQuery.
C.
Use Cloud Data Fusion to design your pipeline, use the Cloud DLP plug-in to de-identify data within your pipeline, and then move the data into BigQuery.
Answers
D.
Set up Datastream to replicate your on-premise data on BigQuery.
D.
Set up Datastream to replicate your on-premise data on BigQuery.
Answers
Suggested answer: B

Explanation:

To load on-premises data into BigQuery while masking sensitive data, we need a solution that offers flexibility for both streaming and batch processing, as well as data masking capabilities. Here's a detailed explanation of why option B is the best choice:

Apache Beam and Dataflow:

Apache Beam SDK provides a unified programming model for both batch and stream data processing.

Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines, offering scalability and ease of use.

Customization for Different Use Cases:

By using the Apache Beam SDK, you can write custom pipelines that can handle both streaming and batch processing within the same framework.

This allows you to switch between streaming and batch modes based on your use case without changing the core logic of your data pipeline.

Data Masking with Cloud DLP:

Google Cloud Data Loss Prevention (DLP) API can be integrated into your Apache Beam pipeline to de-identify and mask sensitive data programmatically before loading it into BigQuery.

This ensures that sensitive data is handled securely and complies with privacy requirements.

Cost Efficiency:

Using Dataflow can be cost-effective because it is a fully managed service, reducing the operational overhead associated with managing your own infrastructure.

The pay-as-you-go model ensures you only pay for the resources you consume, which can help keep costs under control.

Implementation Steps:

Set up Apache Beam Pipeline:

Write a pipeline using the Apache Beam SDK for Python that reads data from your on-premises storage.

Add transformations for data processing, including the integration with Cloud DLP for data masking.

Configure Dataflow:

Deploy the Apache Beam pipeline on Google Cloud Dataflow.

Customize the pipeline options for both streaming and batch use cases.

Load Data into BigQuery:

Set BigQuery as the sink for your data in the Apache Beam pipeline.

Ensure the processed and masked data is loaded into the appropriate BigQuery tables.

Apache Beam Documentation

Google Cloud Dataflow Documentation

Google Cloud DLP Documentation

BigQuery Documentation

Your company operates in three domains: airlines, hotels, and ride-hailing services. Each domain has two teams: analytics and data science, which create data assets in BigQuery with the help of a central data platform team. However, as each domain is evolving rapidly, the central data platform team is becoming a bottleneck. This is causing delays in deriving insights from data, and resulting in stale data when pipelines are not kept up to date. You need to design a data mesh architecture by using Dataplex to eliminate the bottleneck. What should you do?

A.
1. Create one lake for each team. Inside each lake, create one zone for each domain. 2. Attach each of the BigQuery datasets created by the individual teams as assets to the respective zone. 3. Have the central data platform team manage all zones' data assets.
A.
1. Create one lake for each team. Inside each lake, create one zone for each domain. 2. Attach each of the BigQuery datasets created by the individual teams as assets to the respective zone. 3. Have the central data platform team manage all zones' data assets.
Answers
B.
1 Create one lake for each team. Inside each lake, create one zone for each domain. 2. Attach each to the BigQuory datasets created by the individual teams as assets to the respective zone. 3. Direct each domain to manage their own zone's data assets.
B.
1 Create one lake for each team. Inside each lake, create one zone for each domain. 2. Attach each to the BigQuory datasets created by the individual teams as assets to the respective zone. 3. Direct each domain to manage their own zone's data assets.
Answers
C.
1 Create one lake for each domain. Inside each lake, create one zone for each team. 2. Attach each of the BigQuery datasets created by the individual teams as assets to the respective zone. 3. Direct each domain to manage their own lake's data assets.
C.
1 Create one lake for each domain. Inside each lake, create one zone for each team. 2. Attach each of the BigQuery datasets created by the individual teams as assets to the respective zone. 3. Direct each domain to manage their own lake's data assets.
Answers
D.
1 Create one lake for each domain. Inside each lake, create one zone for each team. 2. Attach each of the BigQuery datasets created by the individual teams as assets to the respective zone. 3. Have the central data platform team manage all lakes' data assets.
D.
1 Create one lake for each domain. Inside each lake, create one zone for each team. 2. Attach each of the BigQuery datasets created by the individual teams as assets to the respective zone. 3. Have the central data platform team manage all lakes' data assets.
Answers
Suggested answer: B

Explanation:

To design a data mesh architecture using Dataplex to eliminate bottlenecks caused by a central data platform team, consider the following:

Data Mesh Architecture:

Data mesh promotes a decentralized approach where domain teams manage their own data pipelines and assets, increasing agility and reducing bottlenecks.

Dataplex Lakes and Zones:

Lakes in Dataplex are logical containers for managing data at scale, and zones are subdivisions within lakes for organizing data based on domains, teams, or other criteria.

Domain and Team Management:

By creating a lake for each team and zones for each domain, each team can independently manage their data assets without relying on the central data platform team.

This setup aligns with the principles of data mesh, promoting ownership and reducing delays in data processing and insights.

Implementation Steps:

Create Lakes and Zones:

Create separate lakes in Dataplex for each team (analytics and data science).

Within each lake, create zones for the different domains (airlines, hotels, ride-hailing).

Attach BigQuery Datasets:

Attach the BigQuery datasets created by the respective teams as assets to their corresponding zones.

Decentralized Management:

Allow each domain to manage their own zone's data assets, providing them with the autonomy to update and maintain their pipelines without depending on the central team.

Dataplex Documentation

BigQuery Documentation

Data Mesh Principles

You have important legal hold documents in a Cloud Storage bucket. You need to ensure that these documents are not deleted or modified. What should you do?

A.
Set a retention policy. Lock the retention policy.
A.
Set a retention policy. Lock the retention policy.
Answers
B.
Set a retention policy. Set the default storage class to Archive for long-term digital preservation.
B.
Set a retention policy. Set the default storage class to Archive for long-term digital preservation.
Answers
C.
Enable the Object Versioning feature. Add a lifecycle rule.
C.
Enable the Object Versioning feature. Add a lifecycle rule.
Answers
D.
Enable the Object Versioning feature. Create a copy in a bucket in a different region.
D.
Enable the Object Versioning feature. Create a copy in a bucket in a different region.
Answers
Suggested answer: A

Explanation:

To ensure that important legal hold documents in a Cloud Storage bucket are not deleted or modified, the most effective method is to set and lock a retention policy. Here's why this is the best choice:

Retention Policy:

A retention policy defines a retention period during which objects in the bucket cannot be deleted or modified. This ensures data immutability.

Once a retention policy is set and locked, it cannot be removed or reduced, providing strong protection against accidental or malicious deletions.

Locking the Retention Policy:

Locking a retention policy ensures that the retention period cannot be changed. This action is permanent and guarantees that the specified retention period will be enforced.

Steps to Implement:

Set the Retention Policy:

Define a retention period for the bucket to ensure that all objects are protected for the required duration.

Lock the Retention Policy:

Lock the retention policy to prevent any modifications, ensuring the immutability of the documents.

Cloud Storage Retention Policy Documentation

How to Set a Retention Policy

You are running your BigQuery project in the on-demand billing model and are executing a change data capture (CDC) process that ingests data. The CDC process loads 1 GB of data every 10 minutes into a temporary table, and then performs a merge into a 10 TB target table. This process is very scan intensive and you want to explore options to enable a predictable cost model. You need to create a BigQuery reservation based on utilization information gathered from BigQuery Monitoring and apply the reservation to the CDC process. What should you do?

A.
Create a BigQuery reservation for the job.
A.
Create a BigQuery reservation for the job.
Answers
B.
Create a BigQuery reservation for the service account running the job.
B.
Create a BigQuery reservation for the service account running the job.
Answers
C.
Create a BigQuery reservation for the dataset.
C.
Create a BigQuery reservation for the dataset.
Answers
D.
Create a BigQuery reservation for the project.
D.
Create a BigQuery reservation for the project.
Answers
Suggested answer: D

Explanation:

https://cloud.google.com/blog/products/data-analytics/manage-bigquery-costs-with-custom-quotas.

Here's why creating a BigQuery reservation for the project is the most suitable solution:

Project-Level Reservation: BigQuery reservations are applied at the project level. This means that the reserved slots (processing capacity) are shared across all jobs and queries running within that project. Since your CDC process is a significant contributor to your BigQuery usage, reserving slots for the entire project ensures that your CDC process always has access to the necessary resources, regardless of other activities in the project.

Predictable Cost Model: Reservations provide a fixed, predictable cost model. Instead of paying the on-demand price for each query, you pay a fixed monthly fee for the reserved slots. This eliminates the variability of costs associated with on-demand billing, making it easier to budget and forecast your BigQuery expenses.

BigQuery Monitoring: You can use BigQuery Monitoring to analyze the historical usage patterns of your CDC process and other queries within your project. This information helps you determine the appropriate amount of slots to reserve, ensuring that you have enough capacity to handle your workload while optimizing costs.

Why other options are not suitable:

A . Create a BigQuery reservation for the job: BigQuery does not support reservations at the individual job level. Reservations are applied at the project or assignment level.

B . Create a BigQuery reservation for the service account running the job: While you can create reservations for assignments (groups of users or service accounts), it's less efficient than a project-level reservation in this scenario. A project-level reservation covers all jobs within the project, regardless of the service account used.

C . Create a BigQuery reservation for the dataset: BigQuery does not support reservations at the dataset level.

By creating a BigQuery reservation for your project based on your utilization analysis, you can achieve a predictable cost model while ensuring that your CDC process and other queries have the necessary resources to run smoothly.

A web server sends click events to a Pub/Sub topic as messages. The web server includes an event Timestamp attribute in the messages, which is the time when the click occurred. You have a Dataflow streaming job that reads from this Pub/Sub topic through a subscription, applies some transformations, and writes the result to another Pub/Sub topic for use by the advertising department. The advertising department needs to receive each message within 30 seconds of the corresponding click occurrence, but they report receiving the messages late. Your Dataflow job's system lag is about 5 seconds, and the data freshness is about 40 seconds. Inspecting a few messages show no more than 1 second lag between their event Timestamp and publish Time. What is the problem and what should you do?

A.
The advertising department is causing delays when consuming the messages. Work with the advertising department to fix this.
A.
The advertising department is causing delays when consuming the messages. Work with the advertising department to fix this.
Answers
B.
Messages in your Dataflow job are processed in less than 30 seconds, but your job cannot keep up with the backlog in the Pub/Sub subscription. Optimize your job or increase the number of workers to fix this.
B.
Messages in your Dataflow job are processed in less than 30 seconds, but your job cannot keep up with the backlog in the Pub/Sub subscription. Optimize your job or increase the number of workers to fix this.
Answers
C.
The web server is not pushing messages fast enough to Pub/Sub. Work with the web server team to fix this.
C.
The web server is not pushing messages fast enough to Pub/Sub. Work with the web server team to fix this.
Answers
D.
Messages in your Dataflow job are taking more than 30 seconds to process. Optimize your job or increase the number of workers to fix this.
D.
Messages in your Dataflow job are taking more than 30 seconds to process. Optimize your job or increase the number of workers to fix this.
Answers
Suggested answer: B

Explanation:

To ensure that the advertising department receives messages within 30 seconds of the click occurrence, and given the current system lag and data freshness metrics, the issue likely lies in the processing capacity of the Dataflow job. Here's why option B is the best choice:

System Lag and Data Freshness:

The system lag of 5 seconds indicates that Dataflow itself is processing messages relatively quickly.

However, the data freshness of 40 seconds suggests a significant delay before processing begins, indicating a backlog.

Backlog in Pub/Sub Subscription:

A backlog occurs when the rate of incoming messages exceeds the rate at which the Dataflow job can process them, causing delays.

Optimizing the Dataflow Job:

To handle the incoming message rate, the Dataflow job needs to be optimized or scaled up by increasing the number of workers, ensuring it can keep up with the message inflow.

Steps to Implement:

Analyze the Dataflow Job:

Inspect the Dataflow job metrics to identify bottlenecks and inefficiencies.

Optimize Processing Logic:

Optimize the transformations and operations within the Dataflow pipeline to improve processing efficiency.

Increase Number of Workers:

Scale the Dataflow job by increasing the number of workers to handle the higher load, reducing the backlog.

Dataflow Monitoring

Scaling Dataflow Jobs

You have a BigQuery dataset named 'customers'. All tables will be tagged by using a Data Catalog tag template named 'gdpr'. The template contains one mandatory field, 'has sensitive data~. with a boolean value. All employees must be able to do a simple search and find tables in the dataset that have either true or false in the 'has sensitive data' field. However, only the Human Resources (HR) group should be able to see the data inside the tables for which 'hass-ensitive-data' is true. You give the all employees group the bigquery.metadataViewer and bigquery.connectionUser roles on the dataset. You want to minimize configuration overhead. What should you do next?

A.
Create the 'gdpr' tag template with private visibility. Assign the bigquery -dataViewer role to the HR group on the tables that contain sensitive data.
A.
Create the 'gdpr' tag template with private visibility. Assign the bigquery -dataViewer role to the HR group on the tables that contain sensitive data.
Answers
B.
Create the ~gdpr' tag template with private visibility. Assign the datacatalog. tagTemplateViewer role on this tag to the all employees group, and assign the bigquery.dataViewer role to the HR group on the tables that contain sensitive data.
B.
Create the ~gdpr' tag template with private visibility. Assign the datacatalog. tagTemplateViewer role on this tag to the all employees group, and assign the bigquery.dataViewer role to the HR group on the tables that contain sensitive data.
Answers
C.
Create the 'gdpr' tag template with public visibility. Assign the bigquery. dataViewer role to the HR group on the tables that contain sensitive data.
C.
Create the 'gdpr' tag template with public visibility. Assign the bigquery. dataViewer role to the HR group on the tables that contain sensitive data.
Answers
D.
Create the 'gdpr' tag template with public visibility. Assign the datacatalog. tagTemplateViewer role on this tag to the all employees. group, and assign the bijquery.dataViewer role to the HR group on the tables that contain sensitive data.
D.
Create the 'gdpr' tag template with public visibility. Assign the datacatalog. tagTemplateViewer role on this tag to the all employees. group, and assign the bijquery.dataViewer role to the HR group on the tables that contain sensitive data.
Answers
Suggested answer: D

Explanation:

To ensure that all employees can search and find tables with GDPR tags while restricting data access to sensitive tables only to the HR group, follow these steps:

Data Catalog Tag Template:

Use Data Catalog to create a tag template named 'gdpr' with a boolean field 'has sensitive data'. Set the visibility to public so all employees can see the tags.

Roles and Permissions:

Assign the datacatalog.tagTemplateViewer role to the all employees group. This role allows users to view the tags and search for tables based on the 'has sensitive data' field.

Assign the bigquery.dataViewer role to the HR group specifically on tables that contain sensitive data. This ensures only HR can access the actual data in these tables.

Steps to Implement:

Create the GDPR Tag Template:

Define the tag template in Data Catalog with the necessary fields and set visibility to public.

Assign Roles:

Grant the datacatalog.tagTemplateViewer role to the all employees group for visibility into the tags.

Grant the bigquery.dataViewer role to the HR group on tables marked as having sensitive data.

Data Catalog Documentation

Managing Access Control in BigQuery

IAM Roles in Data Catalog

You are architecting a data transformation solution for BigQuery. Your developers are proficient with SOL and want to use the ELT development technique. In addition, your developers need an intuitive coding environment and the ability to manage SQL as code. You need to identify a solution for your developers to build these pipelines. What should you do?

A.
Use Cloud Composer to load data and run SQL pipelines by using the BigQuery job operators.
A.
Use Cloud Composer to load data and run SQL pipelines by using the BigQuery job operators.
Answers
B.
Use Dataflow jobs to read data from Pub/Sub, transform the data, and load the data to BigQuery.
B.
Use Dataflow jobs to read data from Pub/Sub, transform the data, and load the data to BigQuery.
Answers
C.
Use Dataform to build, manage, and schedule SQL pipelines.
C.
Use Dataform to build, manage, and schedule SQL pipelines.
Answers
D.
Use Data Fusion to build and execute ETL pipelines
D.
Use Data Fusion to build and execute ETL pipelines
Answers
Suggested answer: C

Explanation:

To architect a data transformation solution for BigQuery that aligns with the ELT development technique and provides an intuitive coding environment for SQL-proficient developers, Dataform is an optimal choice. Here's why:

ELT Development Technique:

ELT (Extract, Load, Transform) is a process where data is first extracted and loaded into a data warehouse, and then transformed using SQL queries. This is different from ETL, where data is transformed before being loaded into the data warehouse.

BigQuery supports ELT, allowing developers to write SQL transformations directly in the data warehouse.

Dataform:

Dataform is a development environment designed specifically for data transformations in BigQuery and other SQL-based warehouses.

It provides tools for managing SQL as code, including version control and collaborative development.

Dataform integrates well with existing development workflows and supports scheduling and managing SQL-based data pipelines.

Intuitive Coding Environment:

Dataform offers an intuitive and user-friendly interface for writing and managing SQL queries.

It includes features like SQLX, a SQL dialect that extends standard SQL with features for modularity and reusability, which simplifies the development of complex transformation logic.

Managing SQL as Code:

Dataform supports version control systems like Git, enabling developers to manage their SQL transformations as code.

This allows for better collaboration, code reviews, and version tracking.

Dataform Documentation

BigQuery Documentation

Managing ELT Pipelines with Dataform

Total 372 questions
Go to page: of 38