Friday, June 7, 2013

AWS EMR : Getting started for Oracle DBAs

Newer technologies such as MapReduce (AWS EMR, Hadoop) and noSQL (MongoDB, AWS DynamoDB...) can be confusing to Oracle DBAs.  This blog post takes a quick look at AWS Elastic Map Reduce (EMR) and attempts to demystify it for Oracle DBAs.  Going back before RDBMs products, MapReduce is like a mainframe batch job with no restart ability built in.  MapReduce facilities the processing of large volumes of data in one large batch.  This one large batch, however, is broken into tens or hundreds of smaller pieces of work and processed by MapReduce worker nodes.  This makes MapReduce a great solution for processing web logs, sensor data, genome data, large volumes of transactions, telephone call detail records, vote ballots, and other instances where large volumes of data need to be processed once and the results stored.MapReduce is a framework so you have to write to an API in your application in order to take advantage of MapReduce.  There are a number of implementations of this framework including Apache Hadoop and AWS Elastic Map Reduce (EMR).  Apache Hadoop has no native data store associates with it (although Hadoop Distributed File System - HDFS can be used natively).As mentioned, you need to code your own application using the MapReduce framework. AWS makes getting started with MapReduce by providing sample applications for EMR.   One of the five sample EMR applications is a Java application for processing for AWS CloudFront logs.   The  is a Java application that uses Cascading to analyze and generate usage reports from Amazon CloudFront http access logs.   You specify the EMR input source (CloudFront log location in S3) in the JAR arguments and you also specify the S3 bucket that will hold the results (output). 

For the CloudFront HTTP LogAnalyzer the input and output files use S3.  However,  HDFS or AWS DynamoDB are commonly used as input sources and sometimes used as output sources.  You may want to use DynamoDB as an output source if you which to load the results into RedShift or do future BI analysis on the results.  You could also send the results to an AWS SQS queue to be handled later for processing to S3, DynamoDB, RDS or some other persistent data store.

No comments:

Post a Comment