ExamGecko
Home Home / Google / Professional Data Engineer

Google Professional Data Engineer Practice Test - Questions Answers, Page 32

Question list
Search
Search

List of questions

Search

Related questions











Your chemical company needs to manually check documentation for customer order. You use a pull subscription in Pub/Sub so that sales agents get details from the order. You must ensure that you do not process orders twice with different sales agents and that you do not add more complexity to this workflow. What should you do?

A.
Create a transactional database that monitors the pending messages.
A.
Create a transactional database that monitors the pending messages.
Answers
B.
Create a new Pub/Sub push subscription to monitor the orders processed in the agent's system.
B.
Create a new Pub/Sub push subscription to monitor the orders processed in the agent's system.
Answers
C.
Use Pub/Sub exactly-once delivery in your pull subscription.
C.
Use Pub/Sub exactly-once delivery in your pull subscription.
Answers
D.
Use a Deduphcate PTransform in Dataflow before sending the messages to the sales agents.
D.
Use a Deduphcate PTransform in Dataflow before sending the messages to the sales agents.
Answers
Suggested answer: C

Explanation:

Pub/Sub exactly-once delivery is a feature that guarantees that subscriptions do not receive duplicate deliveries of messages based on a Pub/Sub-defined unique message ID. This feature is only supported by the pull subscription type, which is what you are using in this scenario. By enabling exactly-once delivery, you can ensure that each order is processed only once by a sales agent, and that no order is lost or duplicated. This also simplifies your workflow, as you do not need to create a separate database or subscription to monitor the pending or processed messages.Reference:

Exactly-once delivery | Cloud Pub/Sub Documentation

Cloud Pub/Sub Exactly-once Delivery feature is now Generally Available (GA)

You are part of a healthcare organization where data is organized and managed by respective data owners in various storage services. As a result of this decentralized ecosystem, discovering and managing data has become difficult You need to quickly identify and implement a cost-optimized solution to assist your organization with the following

* Data management and discovery

* Data lineage tracking

* Data quality validation

How should you build the solution?

A.
Use BigLake to convert the current solution into a data lake architecture.
A.
Use BigLake to convert the current solution into a data lake architecture.
Answers
B.
Build a new data discovery tool on Google Kubernetes Engine that helps with new source onboarding and data lineage tracking.
B.
Build a new data discovery tool on Google Kubernetes Engine that helps with new source onboarding and data lineage tracking.
Answers
C.
Use BigOuery to track data lineage, and use Dataprep to manage data and perform data quality validation.
C.
Use BigOuery to track data lineage, and use Dataprep to manage data and perform data quality validation.
Answers
D.
Use Dataplex to manage data, track data lineage, and perform data quality validation.
D.
Use Dataplex to manage data, track data lineage, and perform data quality validation.
Answers
Suggested answer: D

Explanation:

Dataplex is a Google Cloud service that provides a unified data fabric for data lakes and data warehouses. It enables data governance, management, and discovery across multiple data domains, zones, and assets. Dataplex also supports data lineage tracking, which shows the origin and transformation of data over time. Dataplex also integrates with Dataprep, a data preparation and quality tool that allows users to clean, enrich, and transform data using a visual interface. Dataprep can also monitor data quality and detect anomalies using machine learning. Therefore, Dataplex is the most suitable solution for the given scenario, as it meets all the requirements of data management and discovery, data lineage tracking, and data quality validation.Reference:

Dataplex overview

Automate data governance, extend your data fabric with Dataplex-BigLake integration

Dataprep documentation

Your team is building a data lake platform on Google Cloud. As a part of the data foundation design, you are planning to store all the raw data in Cloud Storage You are expecting to ingest approximately 25 GB of data a day and your billing department is worried about the increasing cost of storing old data. The current business requirements are:

* The old data can be deleted anytime

* You plan to use the visualization layer for current and historical reporting

* The old data should be available instantly when accessed

* There should not be any charges for data retrieval.

What should you do to optimize for cost?

A.
Create the bucket with the Autoclass storage class feature.
A.
Create the bucket with the Autoclass storage class feature.
Answers
B.
Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearline, 90 days to coldline. and 365 days to archive storage class. Delete old data as needed.
B.
Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearline, 90 days to coldline. and 365 days to archive storage class. Delete old data as needed.
Answers
C.
Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to coldline, 90 days to nearline. and 365 days to archive storage class Delete old data as needed.
C.
Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to coldline, 90 days to nearline. and 365 days to archive storage class Delete old data as needed.
Answers
D.
Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearlme. 45 days to coldline. and 60 days to archive storage class Delete old data as needed.
D.
Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearlme. 45 days to coldline. and 60 days to archive storage class Delete old data as needed.
Answers
Suggested answer: A

Explanation:

- Autoclass automatically moves objects between storage classes without impacting performance or availability, nor incurring retrieval costs. - It continuously optimizes storage costs based on access patterns without the need to set specific lifecycle management policies.

You work for an airline and you need to store weather data in a BigQuery table Weather data will be used as input to a machine learning model. The model only uses the last 30 days of weather data. You want to avoid storing unnecessary data and minimize costs. What should you do?

A.
Create a BigQuery table where each record has an ingestion timestamp Run a scheduled query to delete all the rows with an ingestion timestamp older than 30 days.
A.
Create a BigQuery table where each record has an ingestion timestamp Run a scheduled query to delete all the rows with an ingestion timestamp older than 30 days.
Answers
B.
Create a BigQuery table partitioned by ingestion time Set up partition expiration to 30 days.
B.
Create a BigQuery table partitioned by ingestion time Set up partition expiration to 30 days.
Answers
C.
Create a BigQuery table partitioned by datetime value of the weather date Set up partition expiration to 30 days.
C.
Create a BigQuery table partitioned by datetime value of the weather date Set up partition expiration to 30 days.
Answers
D.
Create a BigQuery table with a datetime column for the day the weather data refers to. Run a scheduled query to delete rows with a datetime value older than 30 days.
D.
Create a BigQuery table with a datetime column for the day the weather data refers to. Run a scheduled query to delete rows with a datetime value older than 30 days.
Answers
Suggested answer: B

Explanation:

Partitioning a table by ingestion time means that the data is divided into partitions based on the time when the data was loaded into the table. This allows you to delete or archive old data by setting a partition expiration policy. You can specify the number of days to keep the data in each partition, and BigQuery automatically deletes the data when it expires. This way, you can avoid storing unnecessary data and minimize costs.

You have terabytes of customer behavioral data streaming from Google Analytics into BigQuery daily Your customers' information, such as their preferences, is hosted on a Cloud SQL for MySQL database Your CRM database is hosted on a Cloud SQL for PostgreSQL instance. The marketing team wants to use your customers' information from the two databases and the customer behavioral data to create marketing campaigns for yearly active customers. You need to ensure that the marketing team can run the campaigns over 100 times a day on typical days and up to 300 during sales. At the same time you want to keep the load on the Cloud SQL databases to a minimum. What should you do?

A.
Create BigQuery connections to both Cloud SQL databases Use BigQuery federated queries on the two databases and the Google Analytics data on BigQuery to run these queries.
A.
Create BigQuery connections to both Cloud SQL databases Use BigQuery federated queries on the two databases and the Google Analytics data on BigQuery to run these queries.
Answers
B.
Create streams in Datastream to replicate the required tables from both Cloud SQL databases to BigQuery for these queries.
B.
Create streams in Datastream to replicate the required tables from both Cloud SQL databases to BigQuery for these queries.
Answers
C.
Create a Dataproc cluster with Trino to establish connections to both Cloud SQL databases and BigQuery, to execute the queries.
C.
Create a Dataproc cluster with Trino to establish connections to both Cloud SQL databases and BigQuery, to execute the queries.
Answers
D.
Create a job on Apache Spark with Dataproc Serverless to query both Cloud SQL databases and the Google Analytics data on BigQuery for these queries.
D.
Create a job on Apache Spark with Dataproc Serverless to query both Cloud SQL databases and the Google Analytics data on BigQuery for these queries.
Answers
Suggested answer: B

Explanation:

Datastream is a serverless Change Data Capture (CDC) and replication service that allows you to stream data changes from Oracle and MySQL databases to Google Cloud services such as BigQuery, Cloud Storage, Cloud SQL, and Pub/Sub. Datastream captures and delivers database changes in real-time, with minimal impact on the source database performance. Datastream also preserves the schema and data types of the source database, and automatically creates and updates the corresponding tables in BigQuery.

By using Datastream, you can replicate the required tables from both Cloud SQL databases to BigQuery, and keep them in sync with the source databases. This way, you can reduce the load on the Cloud SQL databases, as the marketing team can run their queries on the BigQuery tables instead of the Cloud SQL tables. You can also leverage the scalability and performance of BigQuery to query the customer behavioral data from Google Analytics and the customer information from the replicated tables. You can run the queries as frequently as needed, without worrying about the impact on the Cloud SQL databases.

Option A is not a good solution, as BigQuery federated queries allow you to query external data sources such as Cloud SQL databases, but they do not reduce the load on the source databases. In fact, federated queries may increase the load on the source databases, as they need to execute the query statements on the external data sources and return the results to BigQuery. Federated queries also have some limitations, such as data type mappings, quotas, and performance issues.

Option C is not a good solution, as creating a Dataproc cluster with Trino would require more resources and management overhead than using Datastream. Trino is a distributed SQL query engine that can connect to multiple data sources, such as Cloud SQL and BigQuery, and execute queries across them. However, Trino requires a Dataproc cluster to run, which means you need to provision, configure, and monitor the cluster nodes. You also need to install and configure the Trino connector for Cloud SQL and BigQuery, and write the queries in Trino SQL dialect. Moreover, Trino does not replicate or sync the data from Cloud SQL to BigQuery, so the load on the Cloud SQL databases would still be high.

Option D is not a good solution, as creating a job on Apache Spark with Dataproc Serverless would require more coding and processing power than using Datastream. Apache Spark is a distributed data processing framework that can read and write data from various sources, such as Cloud SQL and BigQuery, and perform complex transformations and analytics on them. Dataproc Serverless is a serverless Spark service that allows you to run Spark jobs without managing clusters. However, Spark requires you to write code in Python, Scala, Java, or R, and use the Spark connector for Cloud SQL and BigQuery to access the data sources. Spark also does not replicate or sync the data from Cloud SQL to BigQuery, so the load on the Cloud SQL databases would still be high.Reference:Datastream overview | Datastream | Google Cloud,Datastream concepts | Datastream | Google Cloud,Datastream quickstart | Datastream | Google Cloud,Introduction to federated queries | BigQuery | Google Cloud,Trino overview | Dataproc Documentation | Google Cloud,Dataproc Serverless overview | Dataproc Documentation | Google Cloud,Apache Spark overview | Dataproc Documentation | Google Cloud.

You are running a streaming pipeline with Dataflow and are using hopping windows to group the data as the data arrives. You noticed that some data is arriving late but is not being marked as late data, which is resulting in inaccurate aggregations downstream. You need to find a solution that allows you to capture the late data in the appropriate window. What should you do?

A.
Change your windowing function to session windows to define your windows based on certain activity.
A.
Change your windowing function to session windows to define your windows based on certain activity.
Answers
B.
Change your windowing function to tumbling windows to avoid overlapping window periods.
B.
Change your windowing function to tumbling windows to avoid overlapping window periods.
Answers
C.
Expand your hopping window so that the late data has more time to arrive within the grouping.
C.
Expand your hopping window so that the late data has more time to arrive within the grouping.
Answers
D.
Use watermarks to define the expected data arrival window Allow late data as it arrives.
D.
Use watermarks to define the expected data arrival window Allow late data as it arrives.
Answers
Suggested answer: D

Explanation:

Watermarks are a way of tracking the progress of time in a streaming pipeline. They are used to determine when a window can be closed and the results emitted. Watermarks can be either event-time based or processing-time based. Event-time watermarks track the progress of time based on the timestamps of the data elements, while processing-time watermarks track the progress of time based on the system clock. Event-time watermarks are more accurate, but they require the data source to provide reliable timestamps. Processing-time watermarks are simpler, but they can be affected by system delays or backlogs.

By using watermarks, you can define the expected data arrival window for each windowing function. You can also specify how to handle late data, which is data that arrives after the watermark has passed. You can either discard late data, or allow late data and update the results as new data arrives. Allowing late data requires you to use triggers to control when the results are emitted.

In this case, using watermarks and allowing late data is the best solution to capture the late data in the appropriate window. Changing the windowing function to session windows or tumbling windows will not solve the problem of late data, as they still rely on watermarks to determine when to close the windows. Expanding the hopping window might reduce the amount of late data, but it will also change the semantics of the windowing function and the results.

Streaming pipelines | Cloud Dataflow | Google Cloud

Windowing | Apache Beam

You work for a large ecommerce company. You are using Pub/Sub to ingest the clickstream data to Google Cloud for analytics. You observe that when a new subscriber connects to an existing topic to analyze data, they are unable to subscribe to older data for an upcoming yearly sale event in two months, you need a solution that, once implemented, will enable any new subscriber to read the last 30 days of data. What should you do?

A.
Create a new topic, and publish the last 30 days of data each time a new subscriber connects to an existing topic.
A.
Create a new topic, and publish the last 30 days of data each time a new subscriber connects to an existing topic.
Answers
B.
Set the topic retention policy to 30 days.
B.
Set the topic retention policy to 30 days.
Answers
C.
Set the subscriber retention policy to 30 days.
C.
Set the subscriber retention policy to 30 days.
Answers
D.
Ask the source system to re-push the data to Pub/Sub, and subscribe to it.
D.
Ask the source system to re-push the data to Pub/Sub, and subscribe to it.
Answers
Suggested answer: B

Explanation:

By setting the topic retention policy to 30 days, you can ensure that any new subscriber can access the messages that were published to the topic within the last 30 days1.This feature allows you to replay previously acknowledged messages or initialize new subscribers with historical data2.You can configure the topic retention policy by using the Cloud Console, the gcloud command-line tool, or the Pub/Sub API1.

Option A is not efficient, as it requires creating a new topic and duplicating the data for each new subscriber, which would increase the storage costs and complexity.Option C is not effective, as it only affects the unacknowledged messages in a subscription, and does not allow new subscribers to access older data3. Option D is not feasible, as it depends on the source system's ability and willingness to re-push the data, and it may cause data duplication or inconsistency.Reference:

1: Create a topic | Cloud Pub/Sub Documentation | Google Cloud

2: Replay and purge messages with seek | Cloud Pub/Sub Documentation | Google Cloud

3: When is a PubSub Subscription considered to be inactive?

You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. What should you do?

A.
Use Analytics Hub to control data access, and provide third party companies with access to the dataset
A.
Use Analytics Hub to control data access, and provide third party companies with access to the dataset
Answers
B.
Create a Dataflow job that reads the data in frequent time intervals and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.
B.
Create a Dataflow job that reads the data in frequent time intervals and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.
Answers
C.
Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.
C.
Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.
Answers
D.
Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.
D.
Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.
Answers
Suggested answer: A

Explanation:

Analytics Hub is a service that allows you to securely share and discover data assets across your organization and with external partners. You can use Analytics Hub to create and manage data assets, such as BigQuery datasets, views, and queries, and control who can access them. You can also browse and use data assets that others have shared with you. By using Analytics Hub, you can keep the costs of data sharing low and ensure that the data is current, as the data assets are not copied or moved, but rather referenced from their original sources.

Your new customer has requested daily reports that show their net consumption of Google Cloud compute resources and who used the resources. You need to quickly and efficiently generate these daily reports. What should you do?

A.
Do daily exports of Cloud Logging data to BigQuery. Create views filtering by project, log type, resource, and user.
A.
Do daily exports of Cloud Logging data to BigQuery. Create views filtering by project, log type, resource, and user.
Answers
B.
Filter data in Cloud Logging by project, resource, and user; then export the data in CSV format.
B.
Filter data in Cloud Logging by project, resource, and user; then export the data in CSV format.
Answers
C.
Filter data in Cloud Logging by project, log type, resource, and user, then import the data into BigQuery.
C.
Filter data in Cloud Logging by project, log type, resource, and user, then import the data into BigQuery.
Answers
D.
Export Cloud Logging data to Cloud Storage in CSV format. Cleanse the data using Dataprep, filtering by project, resource, and user.
D.
Export Cloud Logging data to Cloud Storage in CSV format. Cleanse the data using Dataprep, filtering by project, resource, and user.
Answers
Suggested answer: B

Explanation:

https://cloud.google.com/logging/docs/view/logs-explorer-interface?cloudshell=true

You have a table that contains millions of rows of sales data, partitioned by date Various applications and users query this data many times a minute. The query requires aggregating values by using avg. max. and sum, and does not require joining to other tables. The required aggregations are only computed over the past year of data, though you need to retain full historical data in the base tables You want to ensure that the query results always include the latest data from the tables, while also reducing computation cost, maintenance overhead, and duration. What should you do?

A.
Create a materialized view to aggregate the base table data Configure a partition expiration on the base table to retain only the last one year of partitions.
A.
Create a materialized view to aggregate the base table data Configure a partition expiration on the base table to retain only the last one year of partitions.
Answers
B.
Create a materialized view to aggregate the base table data include a filter clause to specify the last one year of partitions.
B.
Create a materialized view to aggregate the base table data include a filter clause to specify the last one year of partitions.
Answers
C.
Create a new table that aggregates the base table data include a filter clause to specify the last year of partitions. Set up a scheduled query to recreate the new table every hour.
C.
Create a new table that aggregates the base table data include a filter clause to specify the last year of partitions. Set up a scheduled query to recreate the new table every hour.
Answers
D.
Create a view to aggregate the base table data Include a filter clause to specify the last year of partitions.
D.
Create a view to aggregate the base table data Include a filter clause to specify the last year of partitions.
Answers
Suggested answer: C

Explanation:

A materialized view is a database object that contains the results of a query, which can be updated periodically. It can improve the performance and efficiency of queries that involve aggregations, joins, or filters. By creating a materialized view to aggregate the base table data and include a filter clause to specify the last one year of partitions, you can ensure that the query results always include the latest data from the tables, while also reducing computation cost, maintenance overhead, and duration. The materialized view will automatically refresh when the base table data changes, and will only use the partitions that match the filter clause. Option A is incorrect because it will delete the historical data from the base table, which is not desired. Option C is incorrect because it will create a redundant table that needs to be updated manually by a scheduled query, which is more complex and costly than using a materialized view. Option D is incorrect because a view does not store any data, but only references the base table data, which means it will not reduce the computation cost or duration of the query.Reference:

Materialized views, ML models in data warehouse - Google Cloud

Data Engineering with Google Cloud Platform - Packt Subscription

Total 372 questions
Go to page: of 38