Enterprise in the Cloud: analyze

Monday, January 6, 2014

Redshift : optimizing query performance with compression, distribution key and sort key

Encoding/compression

In order to determine the correct compression, first issue these commands to clean up dead space and analyze the data in the table:

vacuum orders;

analyze orders;

Then issue this command:

analyze compression orders;

Then create a table that matches the results from the analyze compression statement:

CREATE TABLE orders (

o_orderkey int8 NOT NULL ENCODE MOSTLY32 PRIMARY KEY ,

o_custkey int8 NOT NULL ENCODE MOSTLY32 DISTKEY REFERENCES customer_v3(c_custkey),

o_orderstatus char(1) NOT NULL ENCODE RUNLENGTH ,

o_totalprice numeric(12,2) NOT NULL ENCODE MOSTLY32 ,

o_orderdate date NOT NULL ENCODE BYTEDICT SORTKEY ,

o_orderpriority char(15) NOT NULL ENCODE BYTEDICT ,

o_clerk char(15) NOT NULL ENCODE RAW ,

o_shippriority int4 NOT NULL ENCODE RUNLENGTH ,

o_comment varchar(79) NOT NULL ENCODE TEXT255

);

Partition data using a distribution key. This allows data to be spread out on a cluster to maximize the parallelization potential of the queries. To help queries run fast, the distribution key should be a value that will be used in regularly joined tables. This allows Redshift to co-locate the data of these different entities, reducing IO and network exchanges.

Redshift also uses a specific sort column to know in advance what values of a column are in a given block, and to skip reading that entire block if the values it contains don’t fall into the range of a query. Using columns that are used in filters (i.e. where clauses) helps execution.

Compression depends directly on the data as it is stored on disk, and storage is modified by distribution and sort options. Therefore, if you change sort or distribution key or create a new table that has the same data but different distribution and sort keys you will need to rerun the vacuum, analyze ad analyze compression statements.

Wednesday, December 11, 2013

Redshift data loading basics

Here are some basics when getting started with loading data into Redshift:
1. S3 command to load data (issue once connected to Redshift using SQLWorkbench or the psql tool):
copy orders from 's3://redshift-demo/tpc-h/100/orders/orders.tbl.' CREDENTIALS 'aws_access_key_id=<access key>;aws_secret_access_key=<secret key>' gzip delimiter '|' COMPUPDATE ON;
2. Compression encoding can be done at the S3 copy command or when the table is defined:
A. S3 copy :copy orders from 's3://redshift-demo/tpc-h/100/orders/orders.tbl.' CREDENTIALS 'aws_access_key_id=<access key>;aws_secret_access_key=<secret key>' gzip delimiter '|' COMPUPDATE ON;
B. Table definition : product_name char(20) encode bytedict);
3. Full list of encoding types: http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html

A. None : raw

B. bytedict : In byte dictionary encoding, a separate dictionary of unique values is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256 one-byte values that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the extra values are written into the block in raw, uncompressed form. The process repeats for each disk block. This encoding is very effective when a column contains a limited number of unique values.

C. Delta : Delta encodings are very useful for datetime columns. Delta encoding compresses data by recording the difference between values that follow each other in the column.

D. LZO : LZO encoding provides a very high compression ratio with good performance. LZO encoding works well for CHAR and VARCHAR columns that store very long character strings, especially free form text, such as product descriptions, user comments, or JSON strings.

E. Mostly encoding : Mostly encodings are useful when the data type for a column is larger than most of the stored values require.

F. Runlength encoding : Runlength encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences (the length of the run).

G. Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words recur often.

4.Every table you create has three hidden columns appended to it: INSERT_XID, DELETE_XID, and ROW_ID (OID). A table with 3 user-defined columns contains 6 actual columns, and the user-defined columns are internally numbered as 0, 1, and 2. The INSERT_XID, DELETE_XID, and ROW_ID columns are numbered 3, 4, and 5, respectively, in this example.

5. Analyze command :By default, the COPY command performs an analysis after it loads data into an empty table. To build or update statistics, run the ANALYZE command against: the entire current database, single tab, or one or more specific columns in a single table

6. Vacuum command : In Redshift, every update is effectively a delete and an insert. Since Amazon Redshift does not automatically reclaim and reuse that is freed when you delete rows from tables or update rows in tables, you should run the VACUUM command to reclaim space following deletes.

Monday, January 6, 2014

Redshift : optimizing query performance with compression, distribution key and sort key

Encoding/compression

Distributing data

Wednesday, December 11, 2013

Redshift data loading basics