You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could be stored anywhere in the United States. You need to have a recovery process in place in case of a catastrophic single region failure. You need an approach with a maximum of 15 minutes of data loss (RPO=15 mins). You want to ensure that there is minimal latency when reading the data. What should you do?

Question

Slavomir Ugrevic · Accepted Answer

1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions. 2. Enable turbo replication. 3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the same region. 4. In case of a regional failure, redeploy the Dataproc clusters to the us-south1 region and read from the same bucket.

Slavomir Ugrevic · Answer

1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions. 2. Enable turbo replication. 3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the us-south1 region. 4. In case of a regional failure, redeploy your Dataproc duster to the us-south1 region and continue reading from the same bucket.

Slavomir Ugrevic · Answer

1. Create a Cloud Storage bucket in the US multi-region. 2. Run the Dataproc cluster in a zone in the ua-central1 region, reading data from the US multi-region bucket. 3. In case of a regional failure, redeploy the Dataproc cluster to the us-central2 region and continue reading from the same bucket.

Slavomir Ugrevic · Answer

1. Create two regional Cloud Storage buckets, one in the us-central1 region and one in the us-south1 region. 2. Have the upstream process write data to the us-central1 bucket. Use the Storage Transfer Service to copy data hourly from the us-central1 bucket to the us-south1 bucket. 3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in that region. 4. In case of regional failure, redeploy your Dataproc clusters to the us-south1 region and read from the bucket in that region instead.

Question list

List of questions

Question 1

(0)

Question 2

(0)

Question 3

(0)

Question 4

(0)

Question 5

(0)

Question 6

(0)

Question 7

(0)

Question 8

(0)

Question 9

(0)

Question 10

(0)

Related questions

Question 362 - Professional Data Engineer discussion

Suggested answer: B

0 comments