Google Professional Data Engineer Practice Test - Questions Answers, Page 17
List of questions
Related questions
Question 161
You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets. You use on-demand pricing for BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don't get slots to execute their query and you need to correct this. You'd like to avoid introducing new projects to your account.
What should you do?
Explanation:
Reference https://cloud.google.com/blog/products/gcp/busting-12-myths-about-bigquery
Question 162
You have an Apache Kafka Cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins.
What should you do?
Question 163
You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?
Explanation:
To optimize the performance of a complex Spark job on Dataproc that heavily relies on shufflingoperations, and given the cost constraints of using preemptible VMs, switching from HDDs toSSDs and using HDFS as an intermediate storage layer can significantly improve performance.Here's why option C is the best choice:Performance of SSDs:SSDs provide much faster read and write speeds compared to HDDs, which is crucial forperformance-intensive operations like shuffling in Spark jobs.Using SSDs can reduce I/O bottlenecks during the shuffle phase of your Spark job, improvingoverall job performance.Intermediate Storage with HDFS:Copying data from Google Cloud Storage (GCS) to HDFS for intermediate storage can reducelatency compared to reading directly from GCS.HDFS provides better locality and faster data access within the Dataproc cluster, which cansignificantly improve the efficiency of shuffling and other I/O operations.
Question 164
Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data).
What should you do?
Question 165
You're training a model to predict housing prices based on an available dataset with real estate properties. Your plan is to train a fully connected neural net, and you've discovered that the dataset contains latitude and longtitude of the property. Real estate professionals have told you that the location of the property is highly influential on price, so you'd like to engineer a feature that incorporates this physical dependency.
What should you do?
Explanation:
Feature crosses combine multiple features into a single feature that captures the interactionbetween them. For location data, a feature cross of latitude and longitude can capture spatialdependencies that affect housing prices.This approach allows the neural network to learn complex patterns related to geographiclocation more effectively than using raw latitude and longitude values. Numerical Representation:Converting the feature cross into a numeric column simplifies the input for the neural networkand can improve the model's ability to learn from the data.This method ensures that the model can leverage the combined information from both latitudeand longitude in a meaningful way.Model Training:Using a numeric column for the feature cross helps in regularizing the model and preventsoverfitting, which is crucial for achieving good generalization on unseen data.To engineer a feature that incorporates the physical dependency of location on housing pricesfor a neural network, creating a numeric column from a feature cross of latitude and longitudeis the most effective approach. Here's why option B is the best choice
Question 166
You are deploying MariaDB SQL databases on GCE VM Instances and need to configure monitoring and alerting. You want to collect metrics including network connections, disk IO and replication status from MariaDB with minimal development effort and use StackDriver for dashboards and alerts.
What should you do?
Question 167
You work for a bank. You have a labelled dataset that contains information on already granted loan application and whether these applications have been defaulted. You have been asked to train a model to predict default rates for credit applicants.
What should you do?
Question 168
You need to migrate a 2TB relational database to Google Cloud Platform. You do not have the resources to significantly refactor the application that uses this database and cost to operate is of primary concern.
Which service do you select for storing and serving your data?
Question 169
You're using Bigtable for a real-time application, and you have a heavy load that is a mix of read and writes. You've recently identified an additional use case and need to perform hourly an analytical job to calculate certain statistics across the whole database. You need to ensure both the reliability of your production application as well as the analytical workload.
What should you do?
Question 170
You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery. The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to
BigQuery for analysis. Which job type and transforms should this pipeline use?
Question