ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 147 - MLS-C01 discussion

Report
Export

A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena The dataset contains more than 800.000 records stored as plaintext CSV files Each record contains 200 columns and is approximately 1 5 MB in size Most queries will span 5 to 10 columns only

How should the Machine Learning Specialist transform the dataset to minimize query runtime?

A.
Convert the records to Apache Parquet format
Answers
A.
Convert the records to Apache Parquet format
B.
Convert the records to JSON format
Answers
B.
Convert the records to JSON format
C.
Convert the records to GZIP CSV format
Answers
C.
Convert the records to GZIP CSV format
D.
Convert the records to XML format
Answers
D.
Convert the records to XML format
Suggested answer: A

Explanation:

To optimize the query performance of Athena, one of the best practices is to convert the data into a columnar format, such as Apache Parquet or Apache ORC. Columnar formats store data by columns rather than by rows, which allows Athena to scan only the columns that are relevant to the query, reducing the amount of data read and improving the query speed. Columnar formats also support compression and encoding schemes that can reduce the storage space and the data scanned per query, further enhancing the performance and reducing the cost.

In contrast, plaintext CSV files store data by rows, which means that Athena has to scan the entire row even if only a few columns are needed for the query. This increases the amount of data read and the query latency. Moreover, plaintext CSV files do not support compression or encoding, which means that they take up more storage space and incur higher query costs.

Therefore, the Machine Learning Specialist should transform the dataset to Apache Parquet format to minimize query runtime.

References:

Top 10 Performance Tuning Tips for Amazon Athena

Columnar Storage Formats

Using compressions will reduce the amount of data scanned by Amazon Athena, and also reduce your S3 bucket storage. It's a Win-Win for your AWS bill. Supported formats: GZIP, LZO, SNAPPY (Parquet) and ZLIB.

asked 16/09/2024
Ange YAO
38 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first