Enterprise in the Cloud: loading

Showing posts with label loading. Show all posts

Wednesday, December 11, 2013

Redshift data loading basics

Here are some basics when getting started with loading data into Redshift:
1. S3 command to load data (issue once connected to Redshift using SQLWorkbench or the psql tool):
copy orders from 's3://redshift-demo/tpc-h/100/orders/orders.tbl.' CREDENTIALS 'aws_access_key_id=<access key>;aws_secret_access_key=<secret key>' gzip delimiter '|' COMPUPDATE ON;
2. Compression encoding can be done at the S3 copy command or when the table is defined:
A. S3 copy :copy orders from 's3://redshift-demo/tpc-h/100/orders/orders.tbl.' CREDENTIALS 'aws_access_key_id=<access key>;aws_secret_access_key=<secret key>' gzip delimiter '|' COMPUPDATE ON;
B. Table definition : product_name char(20) encode bytedict);
3. Full list of encoding types: http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html

A. None : raw

B. bytedict : In byte dictionary encoding, a separate dictionary of unique values is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256 one-byte values that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the extra values are written into the block in raw, uncompressed form. The process repeats for each disk block. This encoding is very effective when a column contains a limited number of unique values.

C. Delta : Delta encodings are very useful for datetime columns. Delta encoding compresses data by recording the difference between values that follow each other in the column.

D. LZO : LZO encoding provides a very high compression ratio with good performance. LZO encoding works well for CHAR and VARCHAR columns that store very long character strings, especially free form text, such as product descriptions, user comments, or JSON strings.

E. Mostly encoding : Mostly encodings are useful when the data type for a column is larger than most of the stored values require.

F. Runlength encoding : Runlength encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences (the length of the run).

G. Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words recur often.

4.Every table you create has three hidden columns appended to it: INSERT_XID, DELETE_XID, and ROW_ID (OID). A table with 3 user-defined columns contains 6 actual columns, and the user-defined columns are internally numbered as 0, 1, and 2. The INSERT_XID, DELETE_XID, and ROW_ID columns are numbered 3, 4, and 5, respectively, in this example.

5. Analyze command :By default, the COPY command performs an analysis after it loads data into an empty table. To build or update statistics, run the ANALYZE command against: the entire current database, single tab, or one or more specific columns in a single table

6. Vacuum command : In Redshift, every update is effectively a delete and an insert. Since Amazon Redshift does not automatically reclaim and reuse that is freed when you delete rows from tables or update rows in tables, you should run the VACUUM command to reclaim space following deletes.

Tuesday, July 23, 2013

Redshift loading data and compression

AWS is column based so by virtue of this is compressed. Redshift runs on high disk density instance based storage for further compression. You can tweak the compression setting for columns once you know your data better. More on this here: http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html

Redshift is designed to load data in quickly.  The best approach is using the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow. Your data needs to be in the proper format for loading into your Amazon Redshift table. This section presents guidelines for preparing and verifying your data before the load and for validating a COPY statement

before you execute it.

http://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html

You should definitely break the input file into manageable chucks and load from gzipped micro-slices on S3. 

Be aware of using ETL tools. Unless an ETL tool is integrated with Redshift/S3 it may use the COPY command instead use insert statements.

Here is a very good youtube video on Redshift and data loading:

http://www.youtube.com/watch?v=R1m-fwzXMow

Here is the place in video that discusses the copy command:

http://www.youtube.com/watch?feature=player_detailpage&v=R1m-fwzXMow#t=867s

Tuesday, June 4, 2013

Bulk loading data to AWS from on premise

I spoke about bulk loading data to AWS in this blog post:

http://cloudconclave.blogspot.com/2013/04/on-premise-application-replication-and.html

A couple other options I did not mention are:

1. Apera - Asperasoft Company has developed a proprietary file transfer protocol based on UDP, which has shown to introduce very high-speed file transfer experience over the Internet.

2. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3. Since it is based upon MapReduce, it is most applicable when moving HDFS and other map reduce files as you will obviously have MapReduce installed in your environment.