You've migrated a Hadoop job from an on-premises cluster to Dataproc and Good Storage. Your Spark job is a complex analytical workload fiat consists of many shuffling operations, and initial data are parquet toes (on average 200-400 MB size each) You see some degradation in performance after the migration to Dataproc so you'd like to optimize for it. Your organization is very cost-sensitive so you'd Idee to continue using Dataproc on preemptibles (with 2 non-preemptibles workers only) for this workload. What should you do?

Question

Jennifer Okai Addey · Accepted Answer

Switch from HODs to SSDs override the preemptible VMs configuration to increase the boot disk size

Jennifer Okai Addey · Answer

Increase the see of your parquet files to ensure them to be 1 GB minimum

Jennifer Okai Addey · Answer

Switch to TFRecords format (appr 200 MB per We) instead of parquet files

Jennifer Okai Addey · Answer

Switch from HDDs to SSDs. copy initial data from Cloud Storage to Hadoop Distributed File System (HDFS) run the Spark job and copy results back to Cloud Storage

Question list

List of questions

Question 1

(0)

Question 2

(0)

Question 3

(0)

Question 4

(0)

Question 5

(0)

Question 6

(0)

Question 7

(0)

Question 8

(0)

Question 9

(0)

Question 10

(0)

Related questions

Question 261 - Professional Data Engineer discussion

Suggested answer: A

0 comments