ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 28

Question list
Search
Search

List of questions

Search

Related questions











You want to schedule a number of sequential load and transformation jobs Data files will be added to a Cloud Storage bucket by an upstream process There is no fixed schedule for when the new data arrives Next, a Dataproc job is triggered to perform some transformations and write the data to BigQuery. You then need to run additional transformation jobs in BigQuery The transformation jobs are different for every table These jobs might take hours to complete You need to determine the most efficient and maintainable workflow to process hundreds of tables and provide the freshest data to your end users. What should you do?

A.
1Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Cloud Storage. Dataproc. and BigQuery operators 2 Use a single shared DAG for all tables that need to go through the pipeline 3 Schedule the DAG to run hourly
A.
1Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Cloud Storage. Dataproc. and BigQuery operators 2 Use a single shared DAG for all tables that need to go through the pipeline 3 Schedule the DAG to run hourly
Answers
B.
1 Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators. 2 Create a separate DAG for each table that needs to go through the pipeline 3 Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG
B.
1 Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators. 2 Create a separate DAG for each table that needs to go through the pipeline 3 Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG
Answers
C.
1 Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Cloud Storage, Dataproc. and BigQuery operators 2 Create a separate DAG for each table that needs to go through the pipeline 3 Schedule the DAGs to run hourly
C.
1 Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Cloud Storage, Dataproc. and BigQuery operators 2 Create a separate DAG for each table that needs to go through the pipeline 3 Schedule the DAGs to run hourly
Answers
D.
1 Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators 2 Use a single shared DAG for all tables that need to go through the pipeline. 3 Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG
D.
1 Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators 2 Use a single shared DAG for all tables that need to go through the pipeline. 3 Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG
Answers
Suggested answer: B

Explanation:

This option is the most efficient and maintainable workflow for your use case, as it allows you to process each table independently and trigger the DAGs only when new data arrives in the Cloud Storage bucket.By using the Dataproc and BigQuery operators, you can easily orchestrate the load and transformation jobs for each table, and leverage the scalability and performance of these services12.By creating a separate DAG for each table, you can customize the transformation logic and parameters for each table, and avoid the complexity and overhead of a single shared DAG3.By using a Cloud Storage object trigger, you can launch a Cloud Function that triggers the DAG for the corresponding table, ensuring that the data is processed as soon as possible and reducing the idle time and cost of running the DAGs on a fixed schedule4.

Option A is not efficient, as it runs the DAG hourly regardless of the data arrival, and it uses a single shared DAG for all tables, which makes it harder to maintain and debug. Option C is also not efficient, as it runs the DAGs hourly and does not leverage the Cloud Storage object trigger. Option D is not maintainable, as it uses a single shared DAG for all tables, and it does not use the Cloud Storage operator, which can simplify the data ingestion from the bucket.Reference:

1: Dataproc Operator | Cloud Composer | Google Cloud

2: BigQuery Operator | Cloud Composer | Google Cloud

3: Choose Workflows or Cloud Composer for service orchestration | Workflows | Google Cloud

4: Cloud Storage Object Trigger | Cloud Functions Documentation | Google Cloud

[5]: Triggering DAGs | Cloud Composer | Google Cloud

[6]: Cloud Storage Operator | Cloud Composer | Google Cloud

You maintain ETL pipelines. You notice that a streaming pipeline running on Dataflow is taking a long time to process incoming data, which causes output delays. You also noticed that the pipeline graph was automatically optimized by Dataflow and merged into one step. You want to identify where the potential bottleneck is occurring. What should you do?

A.
Insert a Reshuffle operation after each processing step, and monitor the execution details in the Dataflow console.
A.
Insert a Reshuffle operation after each processing step, and monitor the execution details in the Dataflow console.
Answers
B.
Log debug information in each ParDo function, and analyze the logs at execution time.
B.
Log debug information in each ParDo function, and analyze the logs at execution time.
Answers
C.
Insert output sinks after each key processing step, and observe the writing throughput of each block.
C.
Insert output sinks after each key processing step, and observe the writing throughput of each block.
Answers
D.
Verify that the Dataflow service accounts have appropriate permissions to write the processed data to the output sinks
D.
Verify that the Dataflow service accounts have appropriate permissions to write the processed data to the output sinks
Answers
Suggested answer: A

Explanation:

A Reshuffle operation is a way to force Dataflow to split the pipeline into multiple stages, which can help isolate the performance of each step and identify bottlenecks. By monitoring the execution details in the Dataflow console, you can see the time, CPU, memory, and disk usage of each stage, as well as the number of elements and bytes processed. This can help you diagnose where the pipeline is slowing down and optimize it accordingly.Reference:

1: Reshuffling your data

2: Monitoring pipeline performance using the Dataflow monitoring interface

3: Optimizing pipeline performance

Your organization is modernizing their IT services and migrating to Google Cloud. You need to organize the data that will be stored in Cloud Storage and BigQuery. You need to enable a data mesh approach to share the data between sales, product design, and marketing departments What should you do?

A.
1Create a project for storage of the data for your organization. 2 Create a central Cloud Storage bucket with three folders to store the files for each department. 3. Create a central BigQuery dataset with tables prefixed with the department name. 4 Give viewer rights for the storage project for the users of your departments.
A.
1Create a project for storage of the data for your organization. 2 Create a central Cloud Storage bucket with three folders to store the files for each department. 3. Create a central BigQuery dataset with tables prefixed with the department name. 4 Give viewer rights for the storage project for the users of your departments.
Answers
B.
1Create a project for storage of the data for each of your departments. 2 Enable each department to create Cloud Storage buckets and BigQuery datasets. 3. Create user groups for authorized readers for each bucket and dataset. 4 Enable the IT team to administer the user groups to add or remove users as the departments' request.
B.
1Create a project for storage of the data for each of your departments. 2 Enable each department to create Cloud Storage buckets and BigQuery datasets. 3. Create user groups for authorized readers for each bucket and dataset. 4 Enable the IT team to administer the user groups to add or remove users as the departments' request.
Answers
C.
1 Create multiple projects for storage of the data for each of your departments' applications. 2 Enable each department to create Cloud Storage buckets and BigQuery datasets. 3. Publish the data that each department shared in Analytics Hub. 4 Enable all departments to discover and subscribe to the data they need in Analytics Hub.
C.
1 Create multiple projects for storage of the data for each of your departments' applications. 2 Enable each department to create Cloud Storage buckets and BigQuery datasets. 3. Publish the data that each department shared in Analytics Hub. 4 Enable all departments to discover and subscribe to the data they need in Analytics Hub.
Answers
D.
1 Create multiple projects for storage of the data for each of your departments' applications. 2 Enable each department to create Cloud Storage buckets and BigQuery datasets. 3 In Dataplex, map each department to a data lake and the Cloud Storage buckets, and map the BigQuery datasets to zones. 4 Enable each department to own and share the data of their data lakes.
D.
1 Create multiple projects for storage of the data for each of your departments' applications. 2 Enable each department to create Cloud Storage buckets and BigQuery datasets. 3 In Dataplex, map each department to a data lake and the Cloud Storage buckets, and map the BigQuery datasets to zones. 4 Enable each department to own and share the data of their data lakes.
Answers
Suggested answer: C

Explanation:

Implementing a data mesh approach involves treating data as a product and enabling decentralized data ownership and architecture. The steps outlined in option C support this approach by creating separate projects for each department, which aligns with the principle of domain-oriented decentralized data ownership. By allowing departments to create their own Cloud Storage buckets and BigQuery datasets, it promotes autonomy and self-service. Publishing the data in Analytics Hub facilitates data sharing and discovery across departments, enabling a collaborative environment where data can be easily accessed and utilized by different parts of the organization.

Architecture and functions in a data mesh - Google Cloud

Professional Data Engineer Certification Exam Guide | Learn - Google Cloud

Build a Data Mesh with Dataplex | Google Cloud Skills Boost

You have a network of 1000 sensors. The sensors generate time series data: one metric per sensor per second, along with a timestamp. You already have 1 TB of data, and expect the data to grow by 1 GB every day You need to access this data in two ways. The first access pattern requires retrieving the metric from one specific sensor stored at a specific timestamp, with a median single-digit millisecond latency. The second access pattern requires running complex analytic queries on the data, including joins, once a day. How should you store this data?

A.
Store your data in Bigtable Concatenate the sensor ID and timestamp and use it as the row key Perform an export to BigQuery every day.
A.
Store your data in Bigtable Concatenate the sensor ID and timestamp and use it as the row key Perform an export to BigQuery every day.
Answers
B.
Store your data in BigQuery Concatenate the sensor ID and timestamp. and use it as the primary key.
B.
Store your data in BigQuery Concatenate the sensor ID and timestamp. and use it as the primary key.
Answers
C.
Store your data in Bigtable Concatenate the sensor ID and metric, and use it as the row key Perform an export to BigQuery every day.
C.
Store your data in Bigtable Concatenate the sensor ID and metric, and use it as the row key Perform an export to BigQuery every day.
Answers
D.
Store your data in BigQuery. Use the metric as a primary key.
D.
Store your data in BigQuery. Use the metric as a primary key.
Answers
Suggested answer: A

Explanation:

To store your data in a way that meets both access patterns, you should:

A) Store your data in Bigtable Concatenate the sensor ID and timestamp and use it as the row key Perform an export to BigQuery every day.This option allows you to leverage the high performance and scalability of Bigtable for low-latency point queries on sensor data, as well as the powerful analytics capabilities of BigQuery for complex queries on large datasets. By using the sensor ID and timestamp as the row key, you can ensure that your data is sorted and distributed evenly across Bigtable nodes, and that you can easily retrieve the metric for a specific sensor and time. By performing an export to BigQuery every day, you can transfer your data to a columnar storage format that is optimized for analytical queries, and take advantage of BigQuery's features such as partitioning, clustering, and caching.

B) Store your data in BigQuery Concatenate the sensor ID and timestamp. and use it as the primary key.This option is not optimal because BigQuery is not designed for low-latency point queries, and using a concatenated primary key may result in poor performance and high costs. BigQuery does not support primary keys natively, and you would have to use a unique constraint or a hash function to enforce uniqueness. Moreover, BigQuery charges by the amount of data scanned, so using a long and complex primary key may increase the query cost and complexity.

C) Store your data in Bigtable Concatenate the sensor ID and metric, and use it as the row key Perform an export to BigQuery every day.This option is not optimal because using the sensor ID and metric as the row key may result in data skew and hotspots in Bigtable, as some sensors may generate more metrics than others, or some metrics may be more common than others. This may affect the performance and availability of Bigtable, as well as the efficiency of the export to BigQuery.

D) Store your data in BigQuery. Use the metric as a primary key.This option is not optimal because using the metric as a primary key may result in data duplication and inconsistency in BigQuery, as multiple sensors may generate the same metric at different times, or the same sensor may generate different metrics at the same time. This may affect the accuracy and reliability of your analytical queries, as well as the query cost and complexity.

You have a variety of files in Cloud Storage that your data science team wants to use in their models Currently, users do not have a method to explore, cleanse, and validate the data in Cloud Storage. You are looking for a low code solution that can be used by your data science team to quickly cleanse and explore data within Cloud Storage. What should you do?

A.
Load the data into BigQuery and use SQL to transform the data as necessary Provide the data science team access to staging tables to explore the raw data.
A.
Load the data into BigQuery and use SQL to transform the data as necessary Provide the data science team access to staging tables to explore the raw data.
Answers
B.
Provide the data science team access to Dataflow to create a pipeline to prepare and validate the raw data and load data into BigQuery for data exploration.
B.
Provide the data science team access to Dataflow to create a pipeline to prepare and validate the raw data and load data into BigQuery for data exploration.
Answers
C.
Provide the data science team access to Dataprep to prepare, validate, and explore the data within Cloud Storage.
C.
Provide the data science team access to Dataprep to prepare, validate, and explore the data within Cloud Storage.
Answers
D.
Create an external table in BigQuery and use SQL to transform the data as necessary Provide the data science team access to the external tables to explore the raw data.
D.
Create an external table in BigQuery and use SQL to transform the data as necessary Provide the data science team access to the external tables to explore the raw data.
Answers
Suggested answer: C

Explanation:

Dataprep is a low code, serverless, and fully managed service that allows users to visually explore, cleanse, and validate data in Cloud Storage. It also provides features such as data profiling, data quality, data transformation, and data lineage. Dataprep is integrated with BigQuery, so users can easily export the prepared data to BigQuery for further analysis or modeling. Dataprep is a suitable solution for the data science team to quickly and easily work with the data in Cloud Storage, without having to write code or manage infrastructure. The other options are not as suitable as Dataprep for this use case, because they either require more coding, more infrastructure management, or more data movement. Loading the data into BigQuery, either directly or through Dataflow, would incur additional costs and latency, and may not provide the same level of data exploration and validation as Dataprep. Creating an external table in BigQuery would allow users to query the data in Cloud Storage, but would not provide the same level of data cleansing and transformation as Dataprep.Reference:

Dataprep overview

Dataprep features

Dataprep and BigQuery integration

You store and analyze your relational data in BigQuery on Google Cloud with all data that resides in US regions. You also have a variety of object stores across Microsoft Azure and Amazon Web Services (AWS), also in US regions. You want to query all your data in BigQuery daily with as little movement of data as possible. What should you do?

A.
Load files from AWS and Azure to Cloud Storage with Cloud Shell gautil rsync arguments.
A.
Load files from AWS and Azure to Cloud Storage with Cloud Shell gautil rsync arguments.
Answers
B.
Create a Dataflow pipeline to ingest files from Azure and AWS to BigQuery.
B.
Create a Dataflow pipeline to ingest files from Azure and AWS to BigQuery.
Answers
C.
Use the BigQuery Omni functionality and BigLake tables to query files in Azure and AWS.
C.
Use the BigQuery Omni functionality and BigLake tables to query files in Azure and AWS.
Answers
D.
Use BigQuery Data Transfer Service to load files from Azure and AWS into BigQuery.
D.
Use BigQuery Data Transfer Service to load files from Azure and AWS into BigQuery.
Answers
Suggested answer: C

Explanation:

BigQuery Omni is a multi-cloud analytics solution that lets you use the BigQuery interface to analyze data stored in other public clouds, such as AWS and Azure, without moving or copying the data. BigLake tables are a type of external table that let you query structured data in external data stores with access delegation. By using BigQuery Omni and BigLake tables, you can query data in AWS and Azure object stores directly from BigQuery, with minimal data movement and consistent performance.Reference:

1: Introduction to BigLake tables

2: Deep dive on how BigLake accelerates query performance

3: BigQuery Omni and BigLake (Analytics Data Federation on GCP)

Your business users need a way to clean and prepare data before using the data for analysis. Your business users are less technically savvy and prefer to work with graphical user interfaces to define their transformations. After the data has been transformed, the business users want to perform their analysis directly in a spreadsheet. You need to recommend a solution that they can use. What should you do?

A.
Use Dataprep to clean the data, and write the results to BigQuery Analyze the data by using Connected Sheets.
A.
Use Dataprep to clean the data, and write the results to BigQuery Analyze the data by using Connected Sheets.
Answers
B.
Use Dataprep to clean the data, and write the results to BigQuery Analyze the data by using Looker Studio.
B.
Use Dataprep to clean the data, and write the results to BigQuery Analyze the data by using Looker Studio.
Answers
C.
Use Dataflow to clean the data, and write the results to BigQuery. Analyze the data by using Connected Sheets.
C.
Use Dataflow to clean the data, and write the results to BigQuery. Analyze the data by using Connected Sheets.
Answers
D.
Use Dataflow to clean the data, and write the results to BigQuery. Analyze the data by using Looker Studio.
D.
Use Dataflow to clean the data, and write the results to BigQuery. Analyze the data by using Looker Studio.
Answers
Suggested answer: A

Explanation:

For business users who are less technically savvy and prefer graphical user interfaces, Dataprep is an ideal tool for cleaning and preparing data, as it offers a user-friendly interface for defining data transformations without the need for coding. Once the data is cleaned and prepared, writing the results to BigQuery allows for the storage and management of large datasets. Analyzing the data using Connected Sheets enables business users to work within the familiar environment of a spreadsheet, leveraging the power of BigQuery directly within Google Sheets. This solution aligns with the needs of the users and follows Google's recommended practices for data cleaning, preparation, and analysis.

Connected Sheets | Google Sheets | Google for Developers

Professional Data Engineer Certification Exam Guide | Learn - Google Cloud

Engineer Data in Google Cloud | Google Cloud Skills Boost - Qwiklabs

You are developing a model to identify the factors that lead to sales conversions for your customers. You have completed processing your data. You want to continue through the model development lifecycle. What should you do next?

A.
Use your model to run predictions on fresh customer input data.
A.
Use your model to run predictions on fresh customer input data.
Answers
B.
Test and evaluate your model on your curated data to determine how well the model performs.
B.
Test and evaluate your model on your curated data to determine how well the model performs.
Answers
C.
Monitor your model performance, and make any adjustments needed.
C.
Monitor your model performance, and make any adjustments needed.
Answers
D.
Delineate what data will be used for testing and what will be used for training the model.
D.
Delineate what data will be used for testing and what will be used for training the model.
Answers
Suggested answer: B

Explanation:

After processing your data, the next step in the model development lifecycle is to test and evaluate your model on the curated data. This is crucial to determine the performance of the model and to understand how well it can predict sales conversions for your customers. The evaluation phase involves using various metrics and techniques to assess the accuracy, precision, recall, and other relevant performance indicators of the model.It helps in identifying any issues or areas for improvement before deploying the model in a production environment.Reference: The information provided here is verified by the Google Professional Data Engineer Certification Exam Guide and related resources, which outline the steps and best practices in the model development lifecycle

You need to modernize your existing on-premises data strategy. Your organization currently uses.

* Apache Hadoop clusters for processing multiple large data sets, including on-premises Hadoop Distributed File System (HDFS) for data replication.

* Apache Airflow to orchestrate hundreds of ETL pipelines with thousands of job steps.

You need to set up a new architecture in Google Cloud that can handle your Hadoop workloads and requires minimal changes to your existing orchestration processes. What should you do?

A.
Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases Convert your ETL pipelines to Dataflow.
A.
Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases Convert your ETL pipelines to Dataflow.
Answers
B.
Use Bigtable for your large workloads, with connections to Cloud Storage to handle any HDFS use cases Orchestrate your pipelines with Cloud Composer.
B.
Use Bigtable for your large workloads, with connections to Cloud Storage to handle any HDFS use cases Orchestrate your pipelines with Cloud Composer.
Answers
C.
Use Dataproc to migrate your Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Use Cloud Data Fusion to visually design and deploy your ETL pipelines.
C.
Use Dataproc to migrate your Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Use Cloud Data Fusion to visually design and deploy your ETL pipelines.
Answers
D.
Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Orchestrate your pipelines with Cloud Composer..
D.
Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Orchestrate your pipelines with Cloud Composer..
Answers
Suggested answer: D

Explanation:

Dataproc is a fully managed service that allows you to run Apache Hadoop and Spark workloads on Google Cloud. It is compatible with the open source ecosystem, so you can migrate your existing Hadoop clusters to Dataproc with minimal changes. Cloud Storage is a scalable, durable, and cost-effective object storage service that can replace HDFS for storing and accessing data. Cloud Storage offers interoperability with Hadoop through connectors, so you can use it as a data source or sink for your Dataproc jobs. Cloud Composer is a fully managed service that allows you to create, schedule, and monitor workflows using Apache Airflow. It is integrated with Google Cloud services, such as Dataproc, BigQuery, Dataflow, and Pub/Sub, so you can orchestrate your ETL pipelines across different platforms. Cloud Composer is compatible with your existing Airflow code, so you can migrate your existing orchestration processes to Cloud Composer with minimal changes.

The other options are not as suitable as Dataproc and Cloud Composer for this use case, because they either require more changes to your existing code, or do not meet your requirements. Dataflow is a fully managed service that allows you to create and run scalable data processing pipelines using Apache Beam. However, Dataflow is not compatible with your existing Hadoop code, so you would need to rewrite your ETL pipelines using Beam. Bigtable is a fully managed NoSQL database service that can handle large and complex data sets. However, Bigtable is not compatible with your existing Hadoop code, so you would need to rewrite your queries and applications using Bigtable APIs. Cloud Data Fusion is a fully managed service that allows you to visually design and deploy data integration pipelines using a graphical interface. However, Cloud Data Fusion is not compatible with your existing Airflow code, so you would need to recreate your orchestration processes using Cloud Data Fusion UI.Reference:

Dataproc overview

Cloud Storage connector for Hadoop

Cloud Composer overview

You are running a Dataflow streaming pipeline, with Streaming Engine and Horizontal Autoscaling enabled. You have set the maximum number of workers to 1000. The input of your pipeline is Pub/Sub messages with notifications from Cloud Storage One of the pipeline transforms reads CSV files and emits an element for every CSV line. The Job performance is low. the pipeline is using only 10 workers, and you notice that the autoscaler is not spinning up additional workers. What should you do to improve performance?

A.
Use Dataflow Prime, and enable Right Fitting to increase the worker resources.
A.
Use Dataflow Prime, and enable Right Fitting to increase the worker resources.
Answers
B.
Update the job to increase the maximum number of workers.
B.
Update the job to increase the maximum number of workers.
Answers
C.
Enable Vertical Autoscaling to let the pipeline use larger workers.
C.
Enable Vertical Autoscaling to let the pipeline use larger workers.
Answers
D.
Change the pipeline code, and introduce a Reshuffle step to prevent fusion.
D.
Change the pipeline code, and introduce a Reshuffle step to prevent fusion.
Answers
Suggested answer: A

Explanation:


Total 372 questions
Go to page: of 38