Enterprise in the Cloud: compression

Monday, January 6, 2014

Redshift : optimizing query performance with compression, distribution key and sort key

Encoding/compression

In order to determine the correct compression, first issue these commands to clean up dead space and analyze the data in the table:

vacuum orders;

analyze orders;

Then issue this command:

analyze compression orders;

Then create a table that matches the results from the analyze compression statement:

CREATE TABLE orders (

o_orderkey int8 NOT NULL ENCODE MOSTLY32 PRIMARY KEY ,

o_custkey int8 NOT NULL ENCODE MOSTLY32 DISTKEY REFERENCES customer_v3(c_custkey),

o_orderstatus char(1) NOT NULL ENCODE RUNLENGTH ,

o_totalprice numeric(12,2) NOT NULL ENCODE MOSTLY32 ,

o_orderdate date NOT NULL ENCODE BYTEDICT SORTKEY ,

o_orderpriority char(15) NOT NULL ENCODE BYTEDICT ,

o_clerk char(15) NOT NULL ENCODE RAW ,

o_shippriority int4 NOT NULL ENCODE RUNLENGTH ,

o_comment varchar(79) NOT NULL ENCODE TEXT255

);

Partition data using a distribution key. This allows data to be spread out on a cluster to maximize the parallelization potential of the queries. To help queries run fast, the distribution key should be a value that will be used in regularly joined tables. This allows Redshift to co-locate the data of these different entities, reducing IO and network exchanges.

Redshift also uses a specific sort column to know in advance what values of a column are in a given block, and to skip reading that entire block if the values it contains don’t fall into the range of a query. Using columns that are used in filters (i.e. where clauses) helps execution.

Compression depends directly on the data as it is stored on disk, and storage is modified by distribution and sort options. Therefore, if you change sort or distribution key or create a new table that has the same data but different distribution and sort keys you will need to rerun the vacuum, analyze ad analyze compression statements.

Tuesday, July 23, 2013

Redshift loading data and compression

AWS is column based so by virtue of this is compressed. Redshift runs on high disk density instance based storage for further compression. You can tweak the compression setting for columns once you know your data better. More on this here: http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html

Redshift is designed to load data in quickly.  The best approach is using the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow. Your data needs to be in the proper format for loading into your Amazon Redshift table. This section presents guidelines for preparing and verifying your data before the load and for validating a COPY statement

before you execute it.

http://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html

You should definitely break the input file into manageable chucks and load from gzipped micro-slices on S3. 

Be aware of using ETL tools. Unless an ETL tool is integrated with Redshift/S3 it may use the COPY command instead use insert statements.

Here is a very good youtube video on Redshift and data loading:

http://www.youtube.com/watch?v=R1m-fwzXMow

Here is the place in video that discusses the copy command:

http://www.youtube.com/watch?feature=player_detailpage&v=R1m-fwzXMow#t=867s

Monday, January 6, 2014

Redshift : optimizing query performance with compression, distribution key and sort key

Encoding/compression

Distributing data

Tuesday, July 23, 2013

Redshift loading data and compression