Enterprise in the Cloud: disk

Showing posts with label disk. Show all posts

Thursday, June 26, 2014

AWS encrypting data at rest

Here is good white paper on encrypting data at rest on AWS:
http://media.amazonwebservices.com/AWS_Securing_Data_at_Rest_with_Encryption.pdf

Amazon now offers Amazon EBS native encryption: http://aws.amazon.com/about-aws/whats-new/2014/05/21/Amazon-EBS-encryption-now-available/
S3 has SSE encryption, client side encryption and SSE with key managed by you: http://aws.amazon.com/blogs/aws/s3-encryption-with-your-keys/
All data in Glacier and Redshift is automatically encrypted.

Thursday, April 24, 2014

AWS import/Export can handle volumes of larger then 1TB to be stored on Amazon EBS volumes. However, there is a intermediate step using S3. If your storage device’s capacity is less than or equal to the maximum Amazon EBS volume size of 1TB, its contents will be loaded directly into an Amazon EBS snapshot. So, in theory no size limit. AWS does not mount the file system on your storage device, nor is a file system required to be present. AWS Import/Export performs a block for block copy from your device to an Amazon EBS Snapshot. If your storage device’s capacity exceeds 1TB, a device image will be stored within your specified Amazon S3 log bucket. You can then create a RAID of EBS volumes using software such as Logical Volume Manager, and copy the image from Amazon S3 to this new volume

Wednesday, December 4, 2013

MySQL : EBS snapshots for backing up a logical volume manager

It is possible to use EBS snapshots to backup a MySQL databases when the data is stored on a logical volume manager. You have to be make sure all active/cached data is written to disk and no write happens to the data files during the snapshots.

Snapshotting a stripped volume:

Flush data to disk, lock tables, and freeze disk writes:
1. mysql -u root -p password

(at the MYSQL prompt)

A. FLUSH TABLES WITH READ LOCK;

B. SHOW MASTER STATUS;

C. SYSTEM sudo xfs_freeze -f /data

Snapshot all EBS volumes that are part of the logical volume manager:

2. At the Linux prompt:

A. aws ec2 create-snapshot --volume-id vol-xxxxxxxx --description "Snapshot of /dev/sdf"

B. aws ec2 create-snapshot --volume-id vol-xxxxxxxx --description "Snapshot of /dev/sdg"

Unfreeze disk writes and unlock tables
3. mysql -u root -ppassw-lab awslab

(at the MYSQL prompt)

A. SYSTEM sudo xfs_freeze -u /data

B. UNLOCK TABLES;

Monday, December 2, 2013

Data stores compatible with Amazon EMR

There are a number of different file systems that can be used

1. Hadoop Distributed File System (HDFS) : EC2 local/ephemeral disk is where HDFS resides. The obvious disadvantage is that it’s ephemeral storage which is reclaimed when the cluster ends. It can be used for caching the results produced by intermediate job-flow steps during a large EMR job.
2. Local (ephemeral) EC2 disk : Each EMR node comes with local disk. This disk works well for temporary storage of data that is continually changing, such as buffers, caches, scratch data, and other temporary content.
3. S3 native : Used for input (data set to be reduced) and output/results.
4. S3 block : Stay away from as not as performant as the other options.
5. HBase : HBase is an open source, non-relational, distributed database that runs on top of HDFS. HBase works with Hadoop/EMR, sharing its file system and serving as a direct input and output to EMR jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC).

More information here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html

Tuesday, July 23, 2013

Redshift Query performance

The items that impact performance of the queries against Redshift are:

1. Node type : This is one of two options which are one of the Redshift option types: extra large node (XL) or an eight extra large node (8XL).

2. Number of nodes : The number of nodes you choose depends on the size of your dataset and your desired query performance. Amazon Redshift distributes and executes queries in parallel across all nodes, you can increase query performance by adding nodes to your data cluster. You can monitor query performance in the Amazon Redshift Console and with Amazon Cloud Watch metrics.

3. Sort Key : Keep in mind that not all queries can be optimized by sort key. There is only one sort key for each table. The Redshift query optimizer uses sort order when it determines optimal query plans. If you do frequent range or equality filtering on one column, make this column the sort key. If you frequently join a table, specify the join column as both the sort key and the distribution key. More details here : http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html

4. Distribution key : There is one distribution key per table. If the table has a foreign key or another column that is frequently used as a join key, consider making that column the distribution key. In making this choice, take pairs of joined tables into account. You might get better results when you specify the joining columns as the distribution keys and the sort keys on both tables. This enables the query optimizer to select a faster merge join instead of a hash join when executing the query. If not many joins, then use the column in the group by clause. More on distribution keys here: http://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html

Keep in mind you want to have even distribution across nodes. You can issue the select from svv_diskuage to find out the distribution.
5. Column compression : This has an impact on query performance : http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html

6. Run queries in memory : Redshift supports the ability to run queries entirely from memory. This can obviously impact performance. More details here: http://docs.aws.amazon.com/redshift/latest/dg/c_troubleshooting_query_performance.html

7. Look at the query plan : More details can be found here : http://docs.aws.amazon.com/redshift/latest/dg/c-query-planning.html

8. Look at disk space usage : More details can be found here : http://docs.aws.amazon.com/redshift/latest/dg/c_managing_disk_space.html

9. Workload manager setting : By default, a cluster is configured with one queue that can run five queries concurrently. The workload management (WLM) details found here: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html

Documentation: http://docs.aws.amazon.com/redshift/latest/dg/c_redshift_system_overview.html
Video: http://www.youtube.com/watch?v=6hk0KvjrvfoBlog: http://aws.typepad.com/aws/2012/11/amazon-redshift-the-new-aws-data-warehouse.html

Thursday, June 27, 2013

Oracle RAC on AWS

Oracle Real Application Clusters (RAC) is not natively supported on AWS. The word natively is used because it is possible to run Oracle RAC in an AWS Direct Connect facility http://aws.amazon.com/directconnect/.

There are a number of options when migrating an Oracle RAC database to AWS. The option you use depends upon the reason RAC is being used. For HA and fail over, AWS offers multi-AZ capabilities which can provide the same level of service. For very large databases that require high transaction through put that can not be achieved on a single instance database, Direct Connect would be the solution. Details on these options are as follows:
1. RDS with multi-AZ : Oracle RDS is the managed database service from AWS. Oracle RDS has builtin multi-AZ capabilities. Because RDS is a managed service, AWS takes care of installation, configuration and management of the secondary database, the replication between AZs, and the fail over and fail back of the database instance.
2. EC2 with multi-AZ : Running on EC2 requires the customer or partner to install, configure, manage, and take care of the replication. Oracle Data Guard or GoldenGate can be used for replication.
3. Direct Connect : The AWS partner Datapipe runs RAC in a managed service model using Direct Connect.

Note: Remember that AWS RDS only supports databases up to 6 TB in size. Note this number changed to 6 TB in June 2015 after being 2 TB. https://aws.amazon.com/about-aws/whats-new/2015/06/amazon-rds-increases-storage-limits-to-6TB-for-piops-and-gp2/. Check the whats new web page for any updates related to Oracle on RDS.

Note: The reason Oracle Real Application Clusters (RAC) is not supported on AWS is:
1. Multicast is not supported on the AWS network. An overlay network is possible on AWS: http://cloudconclave.blogspot.com/2013/06/overlay-networks-on-aws.html
2. AWS EBS is not a shared disk / clustered file system.
So, even if you use a solution such as Amazon EFS, GlusterFS, Zadara, SoftNAS, or custom NFS for shared disk you can not use RAC on AWS as you need multicast support. More on Amazon Elastic File System (EFS): https://aws.amazon.com/blogs/aws/amazon-elastic-file-system-shared-file-storage-for-amazon-ec2/

Wednesday, June 5, 2013

How to support on premise and EBS volume back up to S3 with one tool

There are many solutions to move data (backup, replicate, and synchronize) between your on premise environment and AWS S3. However, when you are looking for a solution that can backup your AWS EBS volumes to S3 there are not as many. There are some vendor specific products (i.e. Oracle Secure Backup) but nothing that natively (at a file system or raw partition level) backups up EBS to S3. Such a solution, would also have the added value of using the same tool / product to back up on premise disk and EBS volumes using the same method. One company that offers such a solution is Ctera (http://www.ctera.com).

Monday, May 13, 2013

Distributed File System : Network, Distributed or Clustered ?

I often times hear these three distinct DFSs used to mean one in the same. This presentation does a nice job of describe how they are different and how they are the same:

http://lvee.org/uploads/image_upload/file/273/savchenko-distributed-fs.pdf

Network File System: A single server (or at least an appearance) and multiple network clients.
Examples: NFS, CIFS

Clustered File System:Servers sharing the same local storage (usually SAN at block level)shared storage architecture.
Examples: GFS2, OCFS2

Distributed file system : “Shared nothing” model, independent servers. intelligent server architecture.
Examples: pNFS, AFS

Monday, April 22, 2013

AWS shared disk options

Here the four options most often discussed when considering NAS/shared disk/storage on AWS:

S3 : Sometimes NAS isn't the right solution to the problem; it's just something that's relatively easy to implement.
GlusterFS, Lustre, openAFS : implementation of a distributed filesystem (GlusterFS, Lustre, openAFS, etc). Write performance can be below writing to EBS.
S3-back 'filesystem' : Use a S3-backed "filesystem" (such as s3fs or Danilo's yas3fs), which is definitely easier to implement. However, write performance could become an issue.
NFS : You could just run NFS on another EC2 instances. However, this will not provide the fault tolerance and scalability that is built into a solution such as GlusterFS, or a solution such a Zadara. With Zadara you can have a central repository/shared file system in a NFS mount that will be accesible from EC2 machines. You can mount Zadara from EC2 via NFS or iSCSI.
Of course, when you are running an Oracle database you will probably not use one of these options. This would be like putting your on premise Oracle database storage on NFS.

Wednesday, December 5, 2012

AWS Provisioned I/Os

Provisioned IOPS volumes are designed to deliver within 10% of the provisioned IOPS performance 99.9% of the time. Therefore, PIOPS are very good choice when running RDBMS systems like Oracle on EC2 or RDS.

Sunday, November 18, 2012

Oracle Database on AWS - Striping disk

Achieving higher disk I/O when running Oracle on AWS on EC2 you can use Oracle Database ASM. More information can be found here:

https://blogs.oracle.com/simonthorpe/entry/configuring_oracle_asm_disks_i