Tuesday, July 23, 2013

Redshift loading data and compression

AWS is column based so by virtue of this is compressed. Redshift runs on high disk density instance based storage for further compression. You can tweak the compression setting for columns once you know your data better. More on this here: http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html

Redshift is designed to load data in quickly.  The best approach is using the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow. Your data needs to be in the proper format for loading into your Amazon Redshift table. This section presents guidelines for preparing and verifying your data before the load and for validating a COPY statement
before you execute it.

You should definitely break the input file into manageable chucks and load from gzipped micro-slices on S3. 

Be aware of using ETL tools. Unless an ETL tool is integrated with Redshift/S3 it may use the COPY command instead use insert statements.

Here is a very good youtube video on Redshift and data loading:

Here is the place in video that discusses the copy command:

No comments:

Post a Comment