Monday, January 6, 2014

Redshift : optimizing query performance with compression, distribution key and sort key


In order to determine the correct compression, first issue these commands to clean up dead space and analyze the data in the table:
vacuum orders;
analyze orders;
Then issue this command:
analyze compression orders;

Then create a table that matches the results from the analyze compression statement:
  o_orderkey int8 NOT NULL ENCODE MOSTLY32 PRIMARY KEY       ,
  o_custkey int8 NOT NULL ENCODE MOSTLY32 DISTKEY REFERENCES customer_v3(c_custkey),
  o_orderstatus char(1) NOT NULL ENCODE RUNLENGTH            ,
  o_totalprice numeric(12,2) NOT NULL ENCODE MOSTLY32        ,
  o_orderdate date NOT NULL ENCODE BYTEDICT SORTKEY          ,
  o_orderpriority char(15) NOT NULL ENCODE BYTEDICT          ,
  o_clerk char(15) NOT NULL ENCODE RAW                       ,
  o_shippriority int4 NOT NULL ENCODE RUNLENGTH              ,
  o_comment varchar(79) NOT NULL ENCODE TEXT255

Distributing data

Partition data using a distribution key. This allows data to be spread out on a cluster to maximize the parallelization potential of the queries. To help queries run fast, the distribution key should be a value that will be used in regularly joined tables. This allows Redshift to co-locate the data of these different entities, reducing IO and network exchanges.
Redshift also uses a specific sort column to know in advance what values of a column are in a given block, and to skip reading that entire block if the values it contains don’t fall into the range of a query. Using columns that are used in filters (i.e. where clauses) helps execution.

Compression depends directly on the data as it is stored on disk, and storage is modified by distribution and sort options.  Therefore, if you change sort or distribution key or create a new table that has the same data but different distribution and sort keys you will need to rerun the vacuum, analyze ad analyze compression statements.

No comments:

Post a Comment