Monday, December 2, 2013

Data stores compatible with Amazon EMR

There are a number of different file systems that can be used

1. Hadoop Distributed File System (HDFS) : EC2 local/ephemeral disk is where HDFS  resides.  The obvious disadvantage is that it’s ephemeral storage which is reclaimed when the cluster ends. It can be used for caching the results produced by intermediate job-flow steps during a large EMR job.
2. Local (ephemeral) EC2 disk :  Each EMR node comes with local disk.  This disk works well for temporary storage of data that is continually changing, such as buffers, caches, scratch data, and other temporary content.
3. S3 native : Used for input (data set to be reduced) and output/results.
4. S3 block : Stay away from as not as performant as the other options.
5. HBase : HBase is an open source, non-relational, distributed database that runs on top of HDFS.  HBase works with Hadoop/EMR, sharing its file system and serving as a direct input and output to EMR jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC).

More information here:

No comments:

Post a Comment