ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 88 - DEA-C01 discussion

Report
Export

A retail company uses an Amazon Redshift data warehouse and an Amazon S3 bucket. The company ingests retail order data into the S3 bucket every day.

The company stores all order data at a single path within the S3 bucket. The data has more than 100 columns. The company ingests the order data from a third-party application that generates more than 30 files in CSV format every day. Each CSV file is between 50 and 70 MB in size.

The company uses Amazon Redshift Spectrum to run queries that select sets of columns. Users aggregate metrics based on daily orders. Recently, users have reported that the performance of the queries has degraded. A data engineer must resolve the performance issues for the queries.

Which combination of steps will meet this requirement with LEAST developmental effort? (Select TWO.)

A.

Configure the third-party application to create the files in a columnar format.

Answers
A.

Configure the third-party application to create the files in a columnar format.

B.

Develop an AWS Glue ETL job to convert the multiple daily CSV files to one file for each day.

Answers
B.

Develop an AWS Glue ETL job to convert the multiple daily CSV files to one file for each day.

C.

Partition the order data in the S3 bucket based on order date.

Answers
C.

Partition the order data in the S3 bucket based on order date.

D.

Configure the third-party application to create the files in JSON format.

Answers
D.

Configure the third-party application to create the files in JSON format.

E.

Load the JSON data into the Amazon Redshift table in a SUPER type column.

Answers
E.

Load the JSON data into the Amazon Redshift table in a SUPER type column.

Suggested answer: A, C

Explanation:

The performance issue in Amazon Redshift Spectrum queries arises due to the nature of CSV files, which are row-based storage formats. Spectrum is more optimized for columnar formats, which significantly improve performance by reducing the amount of data scanned. Also, partitioning data based on relevant columns like order date can further reduce the amount of data scanned, as queries can focus only on the necessary partitions.

A . Configure the third-party application to create the files in a columnar format:

Columnar formats (like Parquet or ORC) store data in a way that is optimized for analytical queries because they allow queries to scan only the columns required, rather than scanning all columns in a row-based format like CSV.

Amazon Redshift Spectrum works much more efficiently with columnar formats, reducing the amount of data that needs to be scanned, which improves query performance.

C . Partition the order data in the S3 bucket based on order date:

Partitioning the data on columns like order date allows Redshift Spectrum to skip scanning unnecessary partitions, leading to improved query performance.

By organizing data into partitions, you minimize the number of files Spectrum has to read, further optimizing performance.

Alternatives Considered:

B (Develop an AWS Glue ETL job): While consolidating files can improve performance by reducing the number of small files (which can be inefficient to process), it adds additional ETL complexity. Switching to a columnar format (Option A) and partitioning (Option C) provides more significant performance improvements with less development effort.

D and E (JSON-related options): Using JSON format or the SUPER type in Redshift introduces complexity and isn't as efficient as the proposed solutions, especially since JSON is not a columnar format.

Amazon Redshift Spectrum Documentation

Columnar Formats and Data Partitioning in S3

asked 29/10/2024
josh hill
37 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first