Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts

Monday, December 2, 2013

Mahout and R


Mahout:
Machine learning library
Supports recommendation mining, clustering, classification, and frequent itemset mining

R:
Language and software environment for statistical computing and graphics
Open source

Mahout is not R and SAS.

SAS article on comparing the three: http://support.sas.com/resources/papers/Benchmark_R_Mahout_SAS.pdf

Data stores compatible with Amazon EMR

There are a number of different file systems that can be used

1. Hadoop Distributed File System (HDFS) : EC2 local/ephemeral disk is where HDFS  resides.  The obvious disadvantage is that it’s ephemeral storage which is reclaimed when the cluster ends. It can be used for caching the results produced by intermediate job-flow steps during a large EMR job.
2. Local (ephemeral) EC2 disk :  Each EMR node comes with local disk.  This disk works well for temporary storage of data that is continually changing, such as buffers, caches, scratch data, and other temporary content.
3. S3 native : Used for input (data set to be reduced) and output/results.
4. S3 block : Stay away from as not as performant as the other options.
5. HBase : HBase is an open source, non-relational, distributed database that runs on top of HDFS.  HBase works with Hadoop/EMR, sharing its file system and serving as a direct input and output to EMR jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC).

More information here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html



Monday, September 9, 2013

Hadoop Cluster and Amazon EMR

Accenture calculated the total cost of ownership of a bare-metal Hadoop cluster and derived the capacity of nine different cloud-based Hadoop clusters at the matched TCO. The performance of each option was then compared by running three real-world Hadoop applications.  The full report is here:

http://www.accenture.com/us-en/Pages/insight-hadoop-deployment-comparison.aspx

There report compares the performance of both a bare-metal Hadoop cluster and Amazon ElasticMapReduce (Amazon EMR).

Friday, June 7, 2013

AWS EMR : Getting started for Oracle DBAs


Newer technologies such as MapReduce (AWS EMR, Hadoop) and noSQL (MongoDB, AWS DynamoDB...) can be confusing to Oracle DBAs.  This blog post takes a quick look at AWS Elastic Map Reduce (EMR) and attempts to demystify it for Oracle DBAs.  Going back before RDBMs products, MapReduce is like a mainframe batch job with no restart ability built in.  MapReduce facilities the processing of large volumes of data in one large batch.  This one large batch, however, is broken into tens or hundreds of smaller pieces of work and processed by MapReduce worker nodes.  This makes MapReduce a great solution for processing web logs, sensor data, genome data, large volumes of transactions, telephone call detail records, vote ballots, and other instances where large volumes of data need to be processed once and the results stored.MapReduce is a framework so you have to write to an API in your application in order to take advantage of MapReduce.  There are a number of implementations of this framework including Apache Hadoop and AWS Elastic Map Reduce (EMR).  Apache Hadoop has no native data store associates with it (although Hadoop Distributed File System - HDFS can be used natively).As mentioned, you need to code your own application using the MapReduce framework. AWS makes getting started with MapReduce by providing sample applications for EMR.   One of the five sample EMR applications is a Java application for processing for AWS CloudFront logs.   The  is a Java application that uses Cascading to analyze and generate usage reports from Amazon CloudFront http access logs.   You specify the EMR input source (CloudFront log location in S3) in the JAR arguments and you also specify the S3 bucket that will hold the results (output). 


For the CloudFront HTTP LogAnalyzer the input and output files use S3.  However,  HDFS or AWS DynamoDB are commonly used as input sources and sometimes used as output sources.  You may want to use DynamoDB as an output source if you which to load the results into RedShift or do future BI analysis on the results.  You could also send the results to an AWS SQS queue to be handled later for processing to S3, DynamoDB, RDS or some other persistent data store.