ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 85 - DEA-C01 discussion

Report
Export

A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB.

Which solution will meet these requirements MOST cost-effectively?

A.

Write a custom Python application. Host the application on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.

Answers
A.

Write a custom Python application. Host the application on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.

B.

Write a PySpark ETL script. Host the script on an Amazon EMR cluster.

Answers
B.

Write a PySpark ETL script. Host the script on an Amazon EMR cluster.

C.

Write an AWS Glue PySpark job. Use Apache Spark to transform the data.

Answers
C.

Write an AWS Glue PySpark job. Use Apache Spark to transform the data.

D.

Write an AWS Glue Python shell job. Use pandas to transform the data.

Answers
D.

Write an AWS Glue Python shell job. Use pandas to transform the data.

Suggested answer: D

Explanation:

AWS Glue is a fully managed serverless ETL service that can handle various data sources and formats, including .csv files in Amazon S3. AWS Glue provides two types of jobs: PySpark and Python shell. PySpark jobs use Apache Spark to process large-scale data in parallel, while Python shell jobs use Python scripts to process small-scale data in a single execution environment. For this requirement, a Python shell job is more suitable and cost-effective, as the size of each S3 object is less than 100 MB, which does not require distributed processing. A Python shell job can use pandas, a popular Python library for data analysis, to transform the .csv data as needed. The other solutions are not optimal or relevant for this requirement. Writing a custom Python application and hosting it on an Amazon EKS cluster would require more effort and resources to set up and manage the Kubernetes environment, as well as to handle the data ingestion and transformation logic. Writing a PySpark ETL script and hosting it on an Amazon EMR cluster would also incur more costs and complexity to provision and configure the EMR cluster, as well as to use Apache Spark for processing small data files. Writing an AWS Glue PySpark job would also be less efficient and economical than a Python shell job, as it would involve unnecessary overhead and charges for using Apache Spark for small data files.Reference:

AWS Glue

Working with Python Shell Jobs

pandas

[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide]

asked 29/10/2024
Nezha El Fakraoui
32 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first