ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 44 - DP-600 discussion

Report
Export

You are analyzing customer purchases in a Fabric notebook by using PySpanc You have the following DataFrames:

You need to join the DataFrames on the customer_id column. The solution must minimize data shuffling. You write the following code.

Which code should you run to populate the results DataFrame?

A)

B)

C)

D)

A.
Option A
Answers
A.
Option A
B.
Option B
Answers
B.
Option B
C.
Option C
Answers
C.
Option C
D.
Option D
Answers
D.
Option D
Suggested answer: A

Explanation:

The correct code to populate the results DataFrame with minimal data shuffling is Option A. Using the broadcast function in PySpark is a way to minimize data movement by broadcasting the smaller DataFrame (customers) to each node in the cluster. This is ideal when one DataFrame is much smaller than the other, as in this case with customers. Reference = You can refer to the official Apache Spark documentation for more details on joins and the broadcast hint.

asked 02/10/2024
Thomas Spring
28 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first