Here are some information about EMR:
1. Job Flow : A Job Flow is an Amazon term for an end-to-end processing of data that occurs through a number of compute steps. A Job Flow is defined by the MapReduce application and input and output parameters. An EMR cluster does not need a Job Flow as data processing can be done interactively using Hive, Pig, Impala, or some other language.
2. Task group : The task group is optional. The task group instances do not have HDFS storage so data needs to be transferred to these nodes by the master mode. The task group can off load heavy computational work from the core group instances.
3. S3 : Amazon s3 is used for the input and output storage of data sets to processed and analyzed.
4. AMIs : The EMR cluster nodes are maintained by Amazon. Amazon regularly updates the EC2 AMIs with newer releases of Hadoop, security patches, and more.
5. Map and reduce : The map procedure takes data as input and filters and sorts the data down to a set of key/value pairs that will be processed by the reduce procedure. The reduce procedure performs a summary procedure of grouping, sorting, or counting the key/value pairs. For example, the map procedure parses out the date and time and treats this data element as a key. Then, a reduce procedure can determine a count of each day (date and time).
6. s3cmd : s3cmd is used at the OS command to load data into S3.
7. Job Flow Scheduling : To schedule a Job Flow to run every hour you can configure cron to execute the script.
8. EMR technologies supported as steps : EMR supports six technologies to be used in steps in the EMR cluster:
a. Hive : Open source data warehouse package. The Hive Query Language (HQL) is a lot like RDBMS SQL. It is best for organizations with strong SQL skills. Also has extensions to support direct access to DynamoDB to directly load EMR from DynamoDB.
b. Custom Jar : Core Hadoop Jave libraries preloaded into the EMR cluster.
c. Streaming : Allows you to write Amazon EMR Job Flows in Ruby, Perl, Python, PHP, R, Bash, or C++. Convert an existing ELT job to run in EMR using streaming.
d. Pig : Pig is a data flow engine that is preloaded in the EMR cluster. Good fit for organization with strong SQL skills.
g. HBase : HBase is efficient way of storing large quantities of sparse data using column-based compression and storage. In addition, HBase provides fast lookup of data because data is stored in-memory instead of on disk.
9. Filter statement : A Map custom JAR application uses the filter statement which is like a WHERE clause on a SQL statement.
10. GROUP: A Reduce custom JAR application uses the GROUP statement which is like a GROUP clause in a SQL statement.
11. Limited structure data and late binding : Unlike data warehousing solutions based upon OLAP or RDBMS, Amazon EMR clusters work with unstructured data and perform late binding of the schema.
12. Performance on small data sets : When running Hive queries against EMR, the run time will appear shocking slow against small data sets when compared to running against a traditional RDBMS. The structure nature of the data sets and indexing capabilities make the RDBMS faster. EMR (MapReduce) is made for large , unstructured data sets.
13. Mahout : Mahout is supported in EMR.