In order to determine the correct compression, first issue these commands to clean up dead space and analyze the data in the table:
Then issue this command:
analyze compression orders;
Then create a table that matches the results from the analyze compression statement:
Partition data using a distribution key. This allows data to be spread out on a cluster to maximize the parallelization potential of the queries. To help queries run fast, the distribution key should be a value that will be used in regularly joined tables. This allows Redshift to co-locate the data of these different entities, reducing IO and network exchanges.
Redshift also uses a specific sort column to know in advance what values of a column are in a given block, and to skip reading that entire block if the values it contains don’t fall into the range of a query. Using columns that are used in filters (i.e. where clauses) helps execution.
Compression depends directly on the data as it is stored on disk, and storage is modified by distribution and sort options. Therefore, if you change sort or distribution key or create a new table that has the same data but different distribution and sort keys you will need to rerun the vacuum, analyze ad analyze compression statements.