Tuesday, July 23, 2013

Redshift schema migration


Indexes, foreign keys, primary keys, and arrays are not supported in Redshift. The distribution key, which determines how your data is distributed across the cluster, is a very important part of the schema definition. Check all queries for the table and choose the column that gets joined most frequently for the distribution key to get the best performance. You can only specify one distribution key and if you are joining against multiple columns on a large scale, you might notice a performance degradation. Also, specify the columns your range queries use the most as sort key on your table (can be multi columns in sort key), as it will help with the performance.

Redshift Query performance

The items that impact performance of the queries against Redshift are:
1. Node type : This is one of two options which are one of the Redshift option types: extra large node (XL)  or an eight extra large node (8XL).
2. Number of nodes : The number of nodes you choose depends on the size of your dataset and your desired query performance. Amazon Redshift distributes and executes queries in parallel across all nodes, you can increase query performance by adding nodes to your data cluster.  You can monitor query performance in the Amazon Redshift Console and with Amazon Cloud Watch metrics.
3. Sort Key : Keep in mind that not all queries can be optimized by sort key.   There is only one sort key for each table.   The Redshift query optimizer uses sort order when it determines optimal query plans.  If you do frequent range or equality filtering on one column, make this column the sort key.  If you frequently join a table, specify the join column as both the sort key and the distribution key.  More details here : http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
4. Distribution key :  There is one distribution key per table.  If the table has a foreign key or another column that is frequently used as a join key, consider making that column the distribution key. In making this choice, take pairs of joined tables into account. You might get better results when you specify the joining columns as the distribution keys and the sort keys on both tables. This enables the query optimizer to select a faster merge join instead of a hash join when executing the query.  If not many joins, then use the column in the group by clause.  More on distribution keys here: http://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
Keep in mind you want to have even distribution across nodes. You can issue the select from svv_diskuage to find out the distribution.
5. Column compression : This has an impact on query performance : http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html
6. Run queries in memory : Redshift supports the ability to run queries entirely from memory. This can obviously impact performance.  More details here: http://docs.aws.amazon.com/redshift/latest/dg/c_troubleshooting_query_performance.html
7. Look at the query plan : More details can be found here : http://docs.aws.amazon.com/redshift/latest/dg/c-query-planning.html
8. Look at disk space usage : More details can be found here : http://docs.aws.amazon.com/redshift/latest/dg/c_managing_disk_space.html
9.  Workload manager setting : By default, a cluster is configured with one queue that can run five queries concurrently.  The workload management (WLM) details found here: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html

Documentation: http://docs.aws.amazon.com/redshift/latest/dg/c_redshift_system_overview.html
Video: http://www.youtube.com/watch?v=6hk0KvjrvfoBlog: http://aws.typepad.com/aws/2012/11/amazon-redshift-the-new-aws-data-warehouse.html

Redshift loading data and compression


AWS is column based so by virtue of this is compressed. Redshift runs on high disk density instance based storage for further compression. You can tweak the compression setting for columns once you know your data better. More on this here: http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html

Redshift is designed to load data in quickly.  The best approach is using the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow. Your data needs to be in the proper format for loading into your Amazon Redshift table. This section presents guidelines for preparing and verifying your data before the load and for validating a COPY statement
before you execute it.

You should definitely break the input file into manageable chucks and load from gzipped micro-slices on S3. 

Be aware of using ETL tools. Unless an ETL tool is integrated with Redshift/S3 it may use the COPY command instead use insert statements.

Here is a very good youtube video on Redshift and data loading:


Here is the place in video that discusses the copy command:

Monday, July 22, 2013

AWS CloudFront : points of interest when running Oracle

When using a content delivery network such as AWS CloudFront, one of the first questions is making sure the latest content is at the edge location.  The time-to-live (TTL) is set using the cache-control directive : Cache-Control max-age directive lets you specify how long (in seconds) you want the object to remain in the cache before CloudFront gets the object again from the origin server.  The minimum expiration time CloudFront supports is 0 seconds and the maximum is in the year 2038.   The default, if not set, is 24 hours.   Setting to 0 does not mean the web page will always go to the origin for its content.  It means that CloudFront delegates the authority for cache control to the origin, i.e. the origin server decides whether or not, and if for how long CloudFront caches the objects.


Using CloudFront with Oracle applications such as EBusiness Suite, Peoplesoft, or Siebel would be an interesting exercise as these products have very dynamic web pages.  There is the potential of pointing CloudFront to an AWS ELB which is front of an Oracle Applications deployment.  Not sure how much this would improve performance of the application? 

Moving EBS volumes between Linux and Windows


EBS volumes are portable between instances running different operating systems but this does not mean the underlying file system format will be compatible.  EBS is a block level storage device.  The volume must be formatted with a file system which may or may not run across different EC2 instances based upon the OS of the instance.

Moving between Linux and Windows file formatted EBS formatted volumes were most of the compatible issues may arise. NTFS and EXT4 are some of the most common file system formats.  

An option is to used NTFS as your 'master' file system.  Then use a utility like ntfsprogs (http://en.wikipedia.org/wiki/Ntfsprogson) to use the EBS volume for both Windows and Linux instances. If you don't need writing on Linux, you can mount the NTFS drive and read it on Linux . Linux can natively read NTFS but it can not write to NTFS drives natively.  

The other option is to use EXT4 (or another Linux file system) as your 'master' file system. For example, EXT4 is not natively supported on Windows.  Good discussion here:

Also, ext2fsd  (http://www.ext2fsd.com)  a proven solution for reading EXT4 from Linux.  For those that don't need NTFS, sharing filesystems between Windows and Linux can be done using exFAT. Another potential solution that supports reading and writing to NTFS on Linux is: http://sourceforge.net/projects/ntfs-3g/

Thursday, July 18, 2013

SQS message retention and visibility

There are two important attributes (http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Query_QuerySetQueueAttributes.html set when working with AWS SQS.  The first is the MessageRetentionPeriod and the other is the VisibilityTimeout.  These are two very different and distinct attributes.  The MessageRetentionPeriod is the length of time (in seconds) the message will stay in the queue (unless it is deleted).  The value can be 60 (1 minute) to 1209600 (14 days).  The VisibilityTimeout is the length of time no other applications can NOT see the message while an application is processing the message.  This can be set to be 0 to 43200 (12 hours). The longer the time this is set the longer you expect one process to be working on this message.

SQS automatically deletes messages that have been in a queue for more than maximum message retention period. The default message retention period is 4 days. However, you can set the message retention period to a value from 60 seconds to 1209600 seconds (14 days) with SetQueueAttributes

More on why messages are not deleted once read and the visibility of a message while it is being processed by an application:
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/AboutVT.html

Friday, July 12, 2013

Oracle Enterprise Manager Solutions on AWS

Listen to this webinar, presented by Amazon Web Services (AWS) and Apps Associates, an AWS Partner Network (APN) Advanced Consulting Partner, to hear how you can leverage the AWS platform to run a centralized OEM 12c environment to free up your administrative resources: