ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 6 - MLS-C01 discussion

Report
Export

A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL.

Which storage scheme is MOST adapted to this scenario?

A.
Store datasets as files in Amazon S3.
Answers
A.
Store datasets as files in Amazon S3.
B.
Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.
Answers
B.
Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.
C.
Store datasets as tables in a multi-node Amazon Redshift cluster.
Answers
C.
Store datasets as tables in a multi-node Amazon Redshift cluster.
D.
Store datasets as global tables in Amazon DynamoDB.
Answers
D.
Store datasets as global tables in Amazon DynamoDB.
Suggested answer: A

Explanation:

The best storage scheme for this scenario is to store datasets as files in Amazon S3. Amazon S3 is a scalable, cost-effective, and durable object storage service that can store any amount and type of data. Amazon S3 also supports querying data using SQL with Amazon Athena, a serverless interactive query service that can analyze data directly in S3. This way, the Data Science team can easily explore and analyze their datasets without having to load them into a database or a compute instance.

The other options are not as suitable for this scenario because:

Storing datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance would limit the scalability and availability of the data, as EBS volumes are only accessible within a single availability zone and have a maximum size of 16 TiB. Also, EBS volumes are more expensive than S3 buckets and require provisioning and managing EC2 instances.

Storing datasets as tables in a multi-node Amazon Redshift cluster would incur higher costs and complexity than using S3 and Athena. Amazon Redshift is a data warehouse service that is optimized for analytical queries over structured or semi-structured data. However, it requires setting up and maintaining a cluster of nodes, loading data into tables, and choosing the right distribution and sort keys for optimal performance. Moreover, Amazon Redshift charges for both storage and compute, while S3 and Athena only charge for the amount of data stored and scanned, respectively.

Storing datasets as global tables in Amazon DynamoDB would not be feasible for large amounts of data, as DynamoDB is a key-value and document database service that is designed for fast and consistent performance at any scale. However, DynamoDB has a limit of 400 KB per item and 25 GB per partition key value, which may not be enough for storing large datasets. Also, DynamoDB does not support SQL queries natively, and would require using a service like Amazon EMR or AWS Glue to run SQL queries over DynamoDB data.

References:

Amazon S3 - Cloud Object Storage

Amazon Athena -- Interactive SQL Queries for Data in Amazon S3

Amazon EBS - Amazon Elastic Block Store (EBS)

Amazon Redshift -- Data Warehouse Solution - AWS

Amazon DynamoDB -- NoSQL Cloud Database Service

asked 16/09/2024
janet phillips
36 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first