Enterprise in the Cloud: dw

Showing posts with label dw. Show all posts

Friday, May 2, 2014

Redshift SQL query tools

Here are two tools that can be used with Amazon Redshift to issue query commands:

1 1. SQL Workbench :

http://www.sql-workbench.net/downloads.html

The information required regarding your Redshift cluster are:

A. JDBC or ODBC connection string: example JBDC connection string looks like this: jdbc:postgresql://redshiftexamplecluster.cvhv40lk8tel.us-east-1.redshift.amazonaws.com:5439/redexampletl?tcpKeepAlive=true

B. User ID : master

C. password : password

More details here: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html

2. Aginity : http://www.aginity.com A. Server (endpoint): example looks like this: lab.cfmvmxhkrtel.us-west- 2.redshift.amazonaws.com

B. User ID : master

C. password : password

D. Database: databasename

E. Port: 5439

The nice thing about Aginity is that there is no JDBC driver to install. With SQL Workbench, you will need to download and install

More on where to find the JDBC driver and configure with SQL Workbench

1. Location of driver: http://jdbc.postgresql.org/download/postgresql-8.4-703.jdbc4.jar

2. Add the downloaded driver JAR to the driver to the PostGres Driver in SQL Workbench

Monday, December 9, 2013

Redshift example

This a great place to get started with Redshift: http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html. There are a couple of pieces of information in the step by step instructions I wanted to elaborate on:
1. Creating a cluster subnet group is done in the Redshift area of the console as seen below. This is not evident in the instructions:

This of course assumes a VPC with two public subnets and a route to an IGW has been created.

2. The copy command is issued from SQL Workbench.copy venue from 's3://awssampledb/tickit/venue_pipe.txt' CREDENTIALS 'aws_access_key_id=<your access key>;aws_secret_access_key=<you secret key>' delimiter '|';

3. Make sure to set auto commit otherwise you need to commit each statements or block of statements and the example does not have commit commands.

Saturday, August 10, 2013

Data warehousing load times

Some times when people are moving to AWS for their databases they are looking for load times in the range of 2 TB per hour (this is 650+ MB/second).

1. Green Plum : http://www.emc.com/collateral/hardware/data-sheet/h7419-greenplum-dca-ds.pdf : 10 TB to 60 TB per hour claimed in this document.

2. Exadata : http://www.oracle.com/ocom/groups/public/@otn/documents/webcontent/173705.pdf Data Load Rate: Up to 12 TB/hour

Numbers listed in this book: http://books.google.com/books?id=Whx9nldv8UEC&pg=PA102&lpg=PA102&dq=oracle+database+loads+terabytes+per+hour&source=bl&ots=VEXelUIhd2&sig=9LW4T0PgottDF2kpgvUobd2FonQ&hl=en&sa=X&ei=PHjyUcPKMuKbygHuyIDgBg&ved=0CHkQ6AEwBw#v=onepage&q=oracle%20database%20loads%20terabytes%20per%20hour&f=false

Of course, Greenplum and Exadata are significant monetary investments.

Paraccel claims to achieve 2 TB an hour: http://www.paraccel.com/resources/Analyst-Reports/TDWI_Checklist-Report_Analytic-Databases-for-Big-Data.pdf

Wednesday, July 3, 2013

Redshift - New node and data distribution

What happens when a new node is added to a Redshift cluster?:

A 2-node cluster will distribute data evenly between two nodes based on a hash of the DISTKEY. If a 3rd node is added, the data needs to be rebalanced amongst the 3 nodes. It’s not just a matter of sending all new data to the 3rd node because that would require lookups to figure out where data is stored. Rather, the node needs to be rebalanced by redistributing the data between the 3 nodes. Redshift takes care of this automatically. Just add the nodes and the data moves.

Redshift redistributes the data as follows:

A ‘new’ set of nodes is created (in the above example, 3 nodes would be created)
Redshift moves the data from the 2-node cluster to the 3-node cluster, rebalancing the data during the copy
Users are then flicked across from the ‘old’ 2-node cluster to the ‘new’ 3-node cluster

This is an example of scalable cloud infrastructure — rather than having to ‘upgrade’ an existing system, it is much more efficient to provision a new system, copy data and then decommission the old system. This is a new way of looking at infrastructure that is quite different to the old way of thinking in terms of physical boxes.