Encoding/compression
In order to determine the correct compression, first issue these commands to clean up dead space and analyze the data in the table:
vacuum orders;
analyze orders;
Then issue this command:
analyze compression orders;
Then create a table that matches the results from the analyze compression statement:
CREATE TABLE orders (
o_orderkey int8 NOT NULL ENCODE MOSTLY32
PRIMARY KEY ,
o_custkey int8 NOT NULL ENCODE MOSTLY32
DISTKEY REFERENCES customer_v3(c_custkey),
o_orderstatus char(1) NOT NULL ENCODE
RUNLENGTH ,
o_totalprice numeric(12,2) NOT NULL ENCODE
MOSTLY32 ,
o_orderdate date NOT NULL ENCODE BYTEDICT
SORTKEY ,
o_orderpriority char(15) NOT NULL ENCODE
BYTEDICT ,
o_clerk char(15) NOT NULL ENCODE RAW ,
o_shippriority int4 NOT NULL ENCODE RUNLENGTH ,
o_comment varchar(79) NOT NULL ENCODE TEXT255
);
Distributing
data
Partition data using a distribution key. This allows data to be spread out on a cluster to maximize
the parallelization potential of the queries. To help queries
run fast, the distribution key should be a value that will be
used in regularly joined tables. This allows Redshift to co-locate the data of
these different entities, reducing IO and network exchanges.
Redshift also
uses a specific sort column to know in advance what values of a column are in a
given block, and to skip reading that entire block if the values it contains
don’t fall into the range of a query. Using columns that are used in filters (i.e. where clauses) helps execution.
Compression depends directly on the
data as it is stored on disk, and storage is modified by distribution and sort
options. Therefore, if you change sort or distribution key or create a new table that has the same data but different distribution and sort keys you will need to rerun the vacuum, analyze ad analyze compression statements.
No comments:
Post a Comment