Enterprise in the Cloud: columns

Wednesday, December 11, 2013

Redshift data loading basics

Here are some basics when getting started with loading data into Redshift:
1. S3 command to load data (issue once connected to Redshift using SQLWorkbench or the psql tool):
copy orders from 's3://redshift-demo/tpc-h/100/orders/orders.tbl.' CREDENTIALS 'aws_access_key_id=<access key>;aws_secret_access_key=<secret key>' gzip delimiter '|' COMPUPDATE ON;
2. Compression encoding can be done at the S3 copy command or when the table is defined:
A. S3 copy :copy orders from 's3://redshift-demo/tpc-h/100/orders/orders.tbl.' CREDENTIALS 'aws_access_key_id=<access key>;aws_secret_access_key=<secret key>' gzip delimiter '|' COMPUPDATE ON;
B. Table definition : product_name char(20) encode bytedict);
3. Full list of encoding types: http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html

A. None : raw

B. bytedict : In byte dictionary encoding, a separate dictionary of unique values is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256 one-byte values that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the extra values are written into the block in raw, uncompressed form. The process repeats for each disk block. This encoding is very effective when a column contains a limited number of unique values.

C. Delta : Delta encodings are very useful for datetime columns. Delta encoding compresses data by recording the difference between values that follow each other in the column.

D. LZO : LZO encoding provides a very high compression ratio with good performance. LZO encoding works well for CHAR and VARCHAR columns that store very long character strings, especially free form text, such as product descriptions, user comments, or JSON strings.

E. Mostly encoding : Mostly encodings are useful when the data type for a column is larger than most of the stored values require.

F. Runlength encoding : Runlength encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences (the length of the run).

G. Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words recur often.

4.Every table you create has three hidden columns appended to it: INSERT_XID, DELETE_XID, and ROW_ID (OID). A table with 3 user-defined columns contains 6 actual columns, and the user-defined columns are internally numbered as 0, 1, and 2. The INSERT_XID, DELETE_XID, and ROW_ID columns are numbered 3, 4, and 5, respectively, in this example.

5. Analyze command :By default, the COPY command performs an analysis after it loads data into an empty table. To build or update statistics, run the ANALYZE command against: the entire current database, single tab, or one or more specific columns in a single table

6. Vacuum command : In Redshift, every update is effectively a delete and an insert. Since Amazon Redshift does not automatically reclaim and reuse that is freed when you delete rows from tables or update rows in tables, you should run the VACUUM command to reclaim space following deletes.

Tuesday, July 23, 2013

Redshift Query performance

The items that impact performance of the queries against Redshift are:

1. Node type : This is one of two options which are one of the Redshift option types: extra large node (XL) or an eight extra large node (8XL).

2. Number of nodes : The number of nodes you choose depends on the size of your dataset and your desired query performance. Amazon Redshift distributes and executes queries in parallel across all nodes, you can increase query performance by adding nodes to your data cluster. You can monitor query performance in the Amazon Redshift Console and with Amazon Cloud Watch metrics.

3. Sort Key : Keep in mind that not all queries can be optimized by sort key. There is only one sort key for each table. The Redshift query optimizer uses sort order when it determines optimal query plans. If you do frequent range or equality filtering on one column, make this column the sort key. If you frequently join a table, specify the join column as both the sort key and the distribution key. More details here : http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html

4. Distribution key : There is one distribution key per table. If the table has a foreign key or another column that is frequently used as a join key, consider making that column the distribution key. In making this choice, take pairs of joined tables into account. You might get better results when you specify the joining columns as the distribution keys and the sort keys on both tables. This enables the query optimizer to select a faster merge join instead of a hash join when executing the query. If not many joins, then use the column in the group by clause. More on distribution keys here: http://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html

Keep in mind you want to have even distribution across nodes. You can issue the select from svv_diskuage to find out the distribution.
5. Column compression : This has an impact on query performance : http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html

6. Run queries in memory : Redshift supports the ability to run queries entirely from memory. This can obviously impact performance. More details here: http://docs.aws.amazon.com/redshift/latest/dg/c_troubleshooting_query_performance.html

7. Look at the query plan : More details can be found here : http://docs.aws.amazon.com/redshift/latest/dg/c-query-planning.html

8. Look at disk space usage : More details can be found here : http://docs.aws.amazon.com/redshift/latest/dg/c_managing_disk_space.html

9. Workload manager setting : By default, a cluster is configured with one queue that can run five queries concurrently. The workload management (WLM) details found here: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html

Documentation: http://docs.aws.amazon.com/redshift/latest/dg/c_redshift_system_overview.html
Video: http://www.youtube.com/watch?v=6hk0KvjrvfoBlog: http://aws.typepad.com/aws/2012/11/amazon-redshift-the-new-aws-data-warehouse.html