A customer is collecting clickstream data using Amazon Kinesis and is grouping the events by IP address into 5-minute chunks stored in Amazon S3.
Many analysts in the company use Hive on Amazon EMR to analyze this data. Their queries always reference a single IP address. Data must be optimized for querying based on IP address using Hive running on Amazon EMR. What is the most efficient method to query the data with Hive?

Question

A customer is collecting clickstream data using Amazon Kinesis and is grouping the events by IP address into 5-minute chunks stored in Amazon S3.

Many analysts in the company use Hive on Amazon EMR to analyze this data. Their queries always reference a single IP address. Data must be optimized for querying based on IP address using Hive running on Amazon EMR. What is the most efficient method to query the data with Hive?

KHALID ALSHAHRANI · Accepted Answer

Store an index of the files by IP address in the Amazon DynamoDB metadata store for EMRFS.

KHALID ALSHAHRANI · Answer

Store the Amazon S3 objects with the following naming scheme: bucket_name/source=ip_address/year=yy/month=mm/day=dd/hour=hh/filename.

KHALID ALSHAHRANI · Answer

Store the data in an HBase table with the IP address as the row key.

KHALID ALSHAHRANI · Answer

Store the events for an IP address as a single file in Amazon S3 and add metadata with keys: Hive_Partitioned_IPAddress.

Question list

List of questions

Question 1

(0)

Question 2

(0)

Question 3

(0)

Question 4

(0)

Question 5

(0)

Question 6

(0)

Question 7

(0)

Question 8

(0)

Question 9

(0)

Question 10

(0)

Related questions

Question 28 - BDS-C00 discussion

Suggested answer: A

0 comments