ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 163 - Professional Data Engineer discussion

Report
Export

You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.

What should you do?

A.
Increase the size of your parquet files to ensure them to be 1 GB minimum.
Answers
A.
Increase the size of your parquet files to ensure them to be 1 GB minimum.
B.
Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
Answers
B.
Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
C.
Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
Answers
C.
Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
D.
Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
Answers
D.
Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
Suggested answer: C

Explanation:

To optimize the performance of a complex Spark job on Dataproc that heavily relies on shufflingoperations, and given the cost constraints of using preemptible VMs, switching from HDDs toSSDs and using HDFS as an intermediate storage layer can significantly improve performance.Here's why option C is the best choice:Performance of SSDs:SSDs provide much faster read and write speeds compared to HDDs, which is crucial forperformance-intensive operations like shuffling in Spark jobs.Using SSDs can reduce I/O bottlenecks during the shuffle phase of your Spark job, improvingoverall job performance.Intermediate Storage with HDFS:Copying data from Google Cloud Storage (GCS) to HDFS for intermediate storage can reducelatency compared to reading directly from GCS.HDFS provides better locality and faster data access within the Dataproc cluster, which cansignificantly improve the efficiency of shuffling and other I/O operations.

asked 18/09/2024
Maija Janite
38 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first