Tuesday, July 23, 2013

Redshift Query performance

The items that impact performance of the queries against Redshift are:
1. Node type : This is one of two options which are one of the Redshift option types: extra large node (XL)  or an eight extra large node (8XL).
2. Number of nodes : The number of nodes you choose depends on the size of your dataset and your desired query performance. Amazon Redshift distributes and executes queries in parallel across all nodes, you can increase query performance by adding nodes to your data cluster.  You can monitor query performance in the Amazon Redshift Console and with Amazon Cloud Watch metrics.
3. Sort Key : Keep in mind that not all queries can be optimized by sort key.   There is only one sort key for each table.   The Redshift query optimizer uses sort order when it determines optimal query plans.  If you do frequent range or equality filtering on one column, make this column the sort key.  If you frequently join a table, specify the join column as both the sort key and the distribution key.  More details here : http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
4. Distribution key :  There is one distribution key per table.  If the table has a foreign key or another column that is frequently used as a join key, consider making that column the distribution key. In making this choice, take pairs of joined tables into account. You might get better results when you specify the joining columns as the distribution keys and the sort keys on both tables. This enables the query optimizer to select a faster merge join instead of a hash join when executing the query.  If not many joins, then use the column in the group by clause.  More on distribution keys here: http://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
Keep in mind you want to have even distribution across nodes. You can issue the select from svv_diskuage to find out the distribution.
5. Column compression : This has an impact on query performance : http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html
6. Run queries in memory : Redshift supports the ability to run queries entirely from memory. This can obviously impact performance.  More details here: http://docs.aws.amazon.com/redshift/latest/dg/c_troubleshooting_query_performance.html
7. Look at the query plan : More details can be found here : http://docs.aws.amazon.com/redshift/latest/dg/c-query-planning.html
8. Look at disk space usage : More details can be found here : http://docs.aws.amazon.com/redshift/latest/dg/c_managing_disk_space.html
9.  Workload manager setting : By default, a cluster is configured with one queue that can run five queries concurrently.  The workload management (WLM) details found here: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html

Documentation: http://docs.aws.amazon.com/redshift/latest/dg/c_redshift_system_overview.html
Video: http://www.youtube.com/watch?v=6hk0KvjrvfoBlog: http://aws.typepad.com/aws/2012/11/amazon-redshift-the-new-aws-data-warehouse.html

No comments:

Post a Comment