Enterprise in the Cloud: July 2013

Tuesday, July 23, 2013

Redshift schema migration

Indexes, foreign keys, primary keys, and arrays are not supported in Redshift. The distribution key, which determines how your data is distributed across the cluster, is a very important part of the schema definition. Check all queries for the table and choose the column that gets joined most frequently for the distribution key to get the best performance. You can only specify one distribution key and if you are joining against multiple columns on a large scale, you might notice a performance degradation. Also, specify the columns your range queries use the most as sort key on your table (can be multi columns in sort key), as it will help with the performance.

Redshift Query performance

The items that impact performance of the queries against Redshift are:

1. Node type : This is one of two options which are one of the Redshift option types: extra large node (XL) or an eight extra large node (8XL).

2. Number of nodes : The number of nodes you choose depends on the size of your dataset and your desired query performance. Amazon Redshift distributes and executes queries in parallel across all nodes, you can increase query performance by adding nodes to your data cluster. You can monitor query performance in the Amazon Redshift Console and with Amazon Cloud Watch metrics.

3. Sort Key : Keep in mind that not all queries can be optimized by sort key. There is only one sort key for each table. The Redshift query optimizer uses sort order when it determines optimal query plans. If you do frequent range or equality filtering on one column, make this column the sort key. If you frequently join a table, specify the join column as both the sort key and the distribution key. More details here : http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html

4. Distribution key : There is one distribution key per table. If the table has a foreign key or another column that is frequently used as a join key, consider making that column the distribution key. In making this choice, take pairs of joined tables into account. You might get better results when you specify the joining columns as the distribution keys and the sort keys on both tables. This enables the query optimizer to select a faster merge join instead of a hash join when executing the query. If not many joins, then use the column in the group by clause. More on distribution keys here: http://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html

Keep in mind you want to have even distribution across nodes. You can issue the select from svv_diskuage to find out the distribution.
5. Column compression : This has an impact on query performance : http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html

6. Run queries in memory : Redshift supports the ability to run queries entirely from memory. This can obviously impact performance. More details here: http://docs.aws.amazon.com/redshift/latest/dg/c_troubleshooting_query_performance.html

7. Look at the query plan : More details can be found here : http://docs.aws.amazon.com/redshift/latest/dg/c-query-planning.html

8. Look at disk space usage : More details can be found here : http://docs.aws.amazon.com/redshift/latest/dg/c_managing_disk_space.html

9. Workload manager setting : By default, a cluster is configured with one queue that can run five queries concurrently. The workload management (WLM) details found here: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html

Documentation: http://docs.aws.amazon.com/redshift/latest/dg/c_redshift_system_overview.html
Video: http://www.youtube.com/watch?v=6hk0KvjrvfoBlog: http://aws.typepad.com/aws/2012/11/amazon-redshift-the-new-aws-data-warehouse.html

Redshift loading data and compression

AWS is column based so by virtue of this is compressed. Redshift runs on high disk density instance based storage for further compression. You can tweak the compression setting for columns once you know your data better. More on this here: http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html

Redshift is designed to load data in quickly.  The best approach is using the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow. Your data needs to be in the proper format for loading into your Amazon Redshift table. This section presents guidelines for preparing and verifying your data before the load and for validating a COPY statement

before you execute it.

http://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html

You should definitely break the input file into manageable chucks and load from gzipped micro-slices on S3. 

Be aware of using ETL tools. Unless an ETL tool is integrated with Redshift/S3 it may use the COPY command instead use insert statements.

Here is a very good youtube video on Redshift and data loading:

http://www.youtube.com/watch?v=R1m-fwzXMow

Here is the place in video that discusses the copy command:

http://www.youtube.com/watch?feature=player_detailpage&v=R1m-fwzXMow#t=867s

Monday, July 22, 2013

AWS CloudFront : points of interest when running Oracle

When using a content delivery network such as AWS CloudFront, one of the first questions is making sure the latest content is at the edge location. The time-to-live (TTL) is set using the cache-control directive : Cache-Control max-age directive lets you specify how long (in seconds) you want the object to remain in the cache before CloudFront gets the object again from the origin server. The minimum expiration time CloudFront supports is 0 seconds and the maximum is in the year 2038. The default, if not set, is 24 hours. Setting to 0 does not mean the web page will always go to the origin for its content. It means that CloudFront delegates the authority for cache control to the origin, i.e. the origin server decides whether or not, and if for how long CloudFront caches the objects.

Using CloudFront with Oracle applications such as EBusiness Suite, Peoplesoft, or Siebel would be an interesting exercise as these products have very dynamic web pages. There is the potential of pointing CloudFront to an AWS ELB which is front of an Oracle Applications deployment. Not sure how much this would improve performance of the application?

Moving EBS volumes between Linux and Windows

EBS volumes are portable between instances running different operating systems but this does not mean the underlying file system format will be compatible. EBS is a block level storage device. The volume must be formatted with a file system which may or may not run across different EC2 instances based upon the OS of the instance.

Moving between Linux and Windows file formatted EBS formatted volumes were most of the compatible issues may arise. NTFS and EXT4 are some of the most common file system formats.

An option is to used NTFS as your 'master' file system. Then use a utility like ntfsprogs (http://en.wikipedia.org/wiki/Ntfsprogs on) to use the EBS volume for both Windows and Linux instances. If you don't need writing on Linux, you can mount the NTFS drive and read it on Linux . Linux can natively read NTFS but it can not write to NTFS drives natively.

The other option is to use EXT4 (or another Linux file system) as your 'master' file system. For example, EXT4 is not natively supported on Windows. Good discussion here:

http://superuser.com/questions/465393/how-to-mount-read-write-an-ext4-parti

tion-on-windows-solution-back-to-ntfs

Also, ext2fsd (http://www.ext2fsd.com) a proven solution for reading EXT4 from Linux. For those that don't need NTFS, sharing filesystems between Windows and Linux can be done using exFAT. Another potential solution that supports reading and writing to NTFS on Linux is: http://sourceforge.net/projects/ntfs-3g/

Thursday, July 18, 2013

SQS message retention and visibility

There are two important attributes (http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Query_QuerySetQueueAttributes.html) set when working with AWS SQS. The first is the MessageRetentionPeriod and the other is the VisibilityTimeout. These are two very different and distinct attributes. The MessageRetentionPeriod is the length of time (in seconds) the message will stay in the queue (unless it is deleted). The value can be 60 (1 minute) to 1209600 (14 days). The VisibilityTimeout is the length of time no other applications can NOT see the message while an application is processing the message. This can be set to be 0 to 43200 (12 hours). The longer the time this is set the longer you expect one process to be working on this message.

SQS automatically deletes messages that have been in a queue for more than maximum message retention period. The default message retention period is 4 days. However, you can set the message retention period to a value from 60 seconds to 1209600 seconds (14 days) with SetQueueAttributes

More on why messages are not deleted once read and the visibility of a message while it is being processed by an application:
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/AboutVT.html

Friday, July 12, 2013

Oracle Enterprise Manager Solutions on AWS

Listen to this webinar, presented by Amazon Web Services (AWS) and Apps Associates, an AWS Partner Network (APN) Advanced Consulting Partner, to hear how you can leverage the AWS platform to run a centralized OEM 12c environment to free up your administrative resources:

http://youtu.be/XSBND55sghc

AWS Security 101 for Oracle DBAs, Developers and Architects

Oracle DBAs understand securing data in transit and at rest, but they don't have to deal with file level encryption, security of the databases, firewalls, denial of service attacks, SQL injection attacks, and other OS level security.

General infrastructure security concepts:
1. Some networking concepts such as VPC, VPN, and IPSec also apply to the security realm. More on these concepts can be found here: http://cloudconclave.blogspot.com/2013/07/aws-network-101-for-oracle-dbas.html
1. SSL : The Secure Sockets Layer (SSL) is a commonly-used protocol for managing the security of a message transmission on the Internet. SSL uses the public-and-private key encryption system from RSA, which also includes the use of a digital certificate.
2. ACLs : Access Control Lists (ACLs) specifies which users or system processes are granted access to objects, as well as what operations are allowed on given objects.

3. MFA : Multifactor authentication (MFA) is a security system in which more than one form of authentication is implemented to verify the legitimacy of a transaction. The goal of MFA is to create a layered defense and make it more difficult for an unauthorized person to access a computer system or network. An MFA device can be a Gemalto token (http://onlinenoram.gemalto.com/) or even an iPhone. http://cloudconclave.blogspot.com/2013/06/mfa-made-easy.html
4. Bastion Host :A bastion host is a special purpose computer on a network specifically designed and configured to withstand attacks. Information on bastion hosts on AWS with Oracle on these two posts: http://cloudconclave.blogspot.com/2013/05/aws-bastion-host-as-single-point-of.html http://cloudconclave.blogspot.com/2013/05/dba-and-developer-access-to-oracle.html
5. iptables : iptables are the tables provided by the Linux kernel firewall. These firewall rules make it possible for administrators to control what hosts can connect to the system, and limit risk exposure by limiting the hosts that can connect to a system. Information on iptables for security on AWS here: http://cloudconclave.blogspot.com/2013/06/aws-security-with-iptables.html

6. IDS : An intrusion detection system (IDS) is a device or software application that monitors network or system activities for malicious activities or policy violations and produces reports to a management station. Some systems may attempt to stop an intrusion attempt but this is neither required nor expected of a monitoring system.
7. IPS : Intrusion prevention systems (IPS), also known as intrusion detection and prevention systems (IDPS), are network security appliances that monitor network and/or system activities for malicious activity. The main functions of intrusion prevention systems are to identify malicious activity, log information about this activity, attempt to block/stop it, and report it. Intrusion prevention systems are considered extensions of intrusion detection systems because they both monitor network traffic and/or system activities for malicious activity.

8. DoS : A denial-of-service attack (DoS attack) or distributed denial-of-service attack (DDoS attack) is an attempt to make a machine or network resource unavailable to its intended users. IPS, iptables, AWS security groups, NACLs, and bastion hosts are all ways to prevent DoS attacks.
9. Penetration testing : A penetration test, occasionally pentest, is a method of evaluating computer and network security by simulating an attack on a computer system or network from external and internal threats.

AWS specifics. You must be familiar with all of these concepts in order to perform basic actions on AWS and EC2:

1. Access key and secret key : The access key is used to access AWS using the CLI and TEST API. The REST and Query APIs use your access keys as the credential.You might be using a third-party product such as S3Fox or ElasticWolf that requires your access keys (because the product itself makes AWS requests for you). Although access keys are primarily used for REST or Query APIs, Amazon S3 and Amazon Mechanical Turk also use access keys with their SOAP APIs. Your Access Key ID identifies you as the party responsible for service requests. You include it in each request, so it's not a secret.The secret key provide anyone that possesses them incredible power to perform delete, terminate, start etc actions on your AWS resources (EC2, ELB, S3 etc) so be very careful with them. Don't e-mail it to anyone, include it any AWS requests, or post it on the AWS Discussion Forums. No authorized person from AWS will ever ask for your Secret Access Key.
2. x509 : X.509 certificates are based on the idea of public key cryptography. It is used for \making requests to AWS product SOAP APIs (except for Amazon S3 and Amazon Mechanical Turk, which use access keys for their SOAP APIs). SOAP services are being defocused so x509 will not be used as much moving forward.
3. Key pair file (SSH pem file) : You use an Amazon EC2 key pair (aka: PEM file) each time you launch an EC2 Linux/UNIX or Windows instance. The key pair ensures that only you have access to the instance.Each EC2 key pair includes a key pair name, a private key, and a public key. PEM is a file format that may consist of a certificate (aka. public key), a private key or indeed both concatenated together. Don't pay so much attention to the file extension; it means Privacy Enhanced Mail, a use it didn't see much use for but the file format stuck around. more on using PEM with EC2 here http://cloudconclave.blogspot.com/2012/09/connecting-to-aws-ec2-using-ssh-and-sftp.html
4. Security Groups : A security group acts as a firewall that controls the traffic allowed to reach one or more instances. When you launch an instance, you assign it one or more security groups. You add rules to each security group that control traffic for the instance. You can modify the rules for a security group at any time; the new rules are automatically applied to all instances to which the security group is assigned.

These AWS security concepts are not necessary but one you get beyond the 'playing around phase' of working with AWS these security components are key to working with AWS:
1. ARNs : Amazon Resource Names (ARNs) uniquely identify AWS resources. We require an ARN when you need to specify a resource unambiguously across all of AWS, such as in IAM policies, Amazon Relational Database Service (Amazon RDS) tags, and API calls. Here is an example ARN:

arn:aws:rds:eu-west-1:001234567890:db:mysql-db
ARNs are used extensively with IAM to place security/access policies on AWS services.
2. IAM : AWS Identity and Access Management (IAM) enables you to securely control access to AWS services and resources for your users. Using IAM you can create and manage AWS users and groups and use permissions to allow and deny their permissions to AWS resources. More details here: http://cloudconclave.blogspot.com/2012/10/aws-iam-service.html, http://cloudconclave.blogspot.com/2013/05/aws-getting-started-with-groups-and.html
3. NACLs : Network ACLs operate at the subnet level and evaluate traffic entering and exiting a subnet. Network ACLs can be used to set both Allow and Deny rules. Network ACLs do not filter traffic between instances in the same subnet. In addition, network ACLs perform stateless filtering while security groups perform stateful filtering.

4. S3 SSE : http://cloudconclave.blogspot.com/2013/07/s3-sse-without-request-header.html Server-side encryption is about data encryption at rest, that is, Amazon S3 encrypts your data as it writes it to disks in its data centers and decrypts it for you when you access it. As long as you authenticate your request and you have access permissions, there is no difference in the way you access encrypted or unencrypted objects. Amazon S3 manages encryption and decryption for you. For example, if you share your objects using a pre-signed URL, the pre-signed URL works the same way for both encrypted and unencrypted objects.

5. Data Encryption : AWS does not provide encryption of EBS (Elastic Block Storage) . More details on a couple of vendors that provide solutions here: http://cloudconclave.blogspot.com/2013/04/ebs-volume-encryption.html

Thursday, July 11, 2013

Oracle Database data masking on AWS

A popular first use case for migrating Oracle workloads is to move development, test and QA Oracle Databases to AWS. This data needs to be protected using data masking. Axis Technologies provides a technology that can mask data when an Oracle Database is hosted on AWS EC2.

More can be found here:
http://www.axistechnologyllc.com/solutions/cloudmigration

An introduction to running Oracle on AWS can be found here:
http://cloudconclave.blogspot.com/2013/06/oracle-enterprise-applications-on-aws.html

AWS Networking 101 for Oracle DBAs, Developers and Architects

Oracle DBAs understand TCP/IP and ports as this is how they connect to and manage an Oracle database. However, there is no need to understand other networking constructs such as routing tables, network translation, VPN tunnels, or even a network mask. This blog post will cover networking terminology, AWS networking services and features, and specifics around DNS.

Below are some general network terms and constructions you need to understand when you move to AWS:

1. CIDRs (Classless Inter-Domain Routing) : CIDR is also known as supernetting as it effectively allows multiple subnets to be grouped together for network routing. CIDR specifies an IP address range using a combination of an IP address and its associated network mask. An example is, 192.168.1.0/24. This means that the first three octnets (192, 168, and 1) are fixed and the last octnet is available to use. Therefore, there are 256 IP addresses available to use 192.168.1.0 - 192.168.1.255. CIDRs are used in AWS VPC and security groups.
2. VPN (Virtual Private Network) : Extends a private network across a public network. This allows AWS to be an extension of your corporate network. It also provides security, encryption, and management across your Internet-based connection to AWS.
3. Ipsec : Is a protocol suite for securing IP communications. When you establish a VPN connection to AWS VPC, you create an IPSec tunnel for secure communication over the Internet. More here : http://cloudconclave.blogspot.com/2013/03/getting-started-with-aws-vpc.html
4. Layer 2 and Layer 3 networks : The Internet Protocol (IP) address is a layer 3 address. Layer 3 networks do routing at the IP level. Layer 2 networks operate at the data link layer of the network. Therefore, they use the Media Access Control (MAC) address to determine where to direct the message. AWS is a layer 2 network. The fact AWS is a layer 2 network could impact some of the 3RD party solutions that can run on AWS.
5. Multicast and unicast : Multicast is a true broadcast. The multicast source relies on multicast-enabled routers to forward the packets to all client subnets that have clients listening.Unicast is a one-to one connection between the client and the server. Unicast uses IP delivery methods such as Transmission Control Protocol (TCP) and User Datagram Protocol (UDP), which are session-based protocols. AWS only supports unicast. Some software products (such as Oracle RAC) use multicast so they can not be run on AWS infrastructure.
6. VLAN : A single layer-2 network may be partitioned to create multiple distinct broadcast domains. When using AWS Direct Connect, you can provision virtual interface (VLAN) connections to the AWS cloud, Amazon VPC, or both. You can not extend you data center VLAN into the AWS cloud when using AWS Direct Connect.
7. NAT : Network Address Translation (NAT) is the process of modifying IP address information in IPv4 headers while in transit across a traffic routing device. NAT AWS EC2 instances are used to translate IP addresses in an AWS VPC when instances are in a private subnet and need to communicate with the outside world.
8. SDN : Software-defined networking (SDN) is an approach to computer networking which abstracts the distributed systems, the control plane and the data plane. SDN is similar to what virtual machines have done for compute virtualization. SND is network virtualization.
9. iptables : The Linux iptables are essentially the way an AWS NAT instance does the IP (actually does port routing so AWS NAT is actually a PAT - Port Address Translation).

10. Overlay networks : An overlay network is a computer network which is built on the top of another network. For example, since the AWS network is a layer 2 network that does not support multi-cast, you cloud place a overlay network on top of the base AWS network that supports multi-cast. Blog post on overlay and SDN : http://cloudconclave.blogspot.com/2013/06/overlay-networks-on-aws.html
11. BGP : Border Gateway BC Protocol (BGP) is the protocol which is used to make core routing decisions on the Internet; it involves a table of IP networks or "prefixes" which designate network reachability among autonomous systems (AS). BGP does dynamic routing and AWS refers to a BGP device as the Customer Gateway when using a VPN connection to AWS VPC.
12. ASA : Cisco ASA is a static routing device. The Cisco ASA device is referred to as the Customer Gateway when using a VPN connection to AWS VPC.

These are AWS specific services and components:

1. VPC : Amazon Virtual Private Cloud (Amazon VPC) lets you provision a logically isolated section of the Amazon Web Services (AWS) Cloud where you can launch AWS resources in a virtual network that you define. You have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways.
2. Internet Gateway : The Internet Gateway allows EC2 instances in a VPC communicate with the Internet. When you launch an AWS VPC with a public subnet it comes with an Internet gateway, and instances launched into a public subnet have a public IP address and communicate with the internet using the Internet Gateway.
Instances that you launch into a private subnet do not receive a public IP address, and can't communicate with the Internet. You can enable Internet access for instances that you launch into a private subnet by using a NAT instance.
3. Customer Gateway : A customer gateway is a physical device or software application on your side of the VPN connection. The Customer Gateway is used to create an secure IPsec VPN tunnel to AWS VPC.
4. Virtual Private Gateway A virtual private gateway is the VPN concentrator on the Amazon side of the VPN connection. The VPG is a service provided by AWS.
5. ENI : An elastic network interface (ENI) is a virtual network interface that you can attach to an instance in a VPC. ENIs allow an EC2 instance to have more than one IP address. This includes a primary private IP address, one or more secondary private addresses, or an Elastic IP address. You can create a network interface, attach it to an instance, detach it from an instance, and attach it to another instance. The attributes of a network interface follow the network interface as it is attached or detached from an instance and reattached to another instance. When you move a network interface from one instance to another, network traffic is redirected to the new instance. This is feature is useful for creating a management network, dual homed instances, or security appliances in your VPC.
6. ElasticIP : An Elastic IP address (EIP) is a static public IP address that can be assigned to an EC2 instance or an ENI. A more appropriate name for an EIP may be a Public IP address. With an EIP, you can mask the failure of an instance by rapidly remapping the address to another instance. Your EIP is associated with your AWS account, not a particular instance, and it remains associated with your account until you choose to explicitly release it.

There's one pool of EIPs for use with the EC2-Classic platform and another for use with your VPC. You can't associate an EIP that you allocated for use with a VPC with an instance in EC2-Classic, and vice-versa.
7. Public and Private Subnet : A subnet is a range of IP addresses in your VPC. You can launch AWS resources into a subnet that you select. Use a public subnet for resources that must be connected to the Internet, and a private subnet for resources that won't be connected to the Internet. instances in the public subnet can receive inbound traffic directly from the Internet, whereas the instances in the private subnet can't. The instances in the public subnet can send outbound traffic directly to the Internet, whereas the instances in the private subnet can't.More on public and private subnets can be found here: http://cloudconclave.blogspot.com/2013/05/aws-vpc-public-and-private-subnets.html

8. NAT Instances : Instances that you launch into a private subnet in a virtual private cloud (VPC) can't communicate with the Internet. You can optionally use a network address translation (NAT) instance in a public subnet in your VPC to enable instances in the private subnet to initiate outbound traffic to the Internet, but prevent the instances from receiving inbound traffic initiated by someone on the Internet.
9. Route 53 : Amazon Route 53 is a Domain Name System (DNS) web service. More on Route 53 can be found here: http://cloudconclave.blogspot.com/2013/05/routing-53-as-your-dns-service.html. Route 53 resolves an IP address to a domain name.
10. Direct Connect : Direct Connect makes it easy to establish a dedicated network connection from your premises to AWS. Using AWS Direct Connect, you can establish private connectivity between AWS and your datacenter, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections. Direct Connect has speeds of 1 Gbps or 10 Gbps. When companies are extending their Oracle solutions into the cloud, they often times chose to use Direct Connect as Internet speeds are not fast enough. More on Direct Connect http://cloudconclave.blogspot.com/2013/06/aws-direct-connect-active-active-with.html and http://cloudconclave.blogspot.com/2013/06/aws-vpn-connection-as-direct-connect.html. Direct Connect also refers to a facility that is next to an AWS data center that can be used to host third party hardware and software solutions such as Oracle RAC. More on this here: http://cloudconclave.blogspot.com/2013/06/oracle-rac-on-aws.html
11. CloudFront : CloudFront is an edge location content delivery service. It is mostly used to deliver static content such as web sites, documents, videos, pictures etc. However, it can also be used for dynamic content.

Specific to Route 53 (the AWS DNS Hosting Service):http://cloudconclave.blogspot.com/2013/05/routing-53-as-your-dns-service.html
1. DNS hosting service : A DNS hosting service is a service that runs Domain Name System servers.

2. A records : An A record (Address Record) points a domain or subdomain to an IP address.

3. Zone apex record : I sometimes called the root domain or naked domain. The apex record would be domainname.com without a www or any another prefix.

4. Cname : A CNAME (Canonical Name) points one domain or subdomain to another domain name, allowing you to update one A Record each time you make a change, regardless of how many Host Records need to resolve to that IP address.
5. Alias records : Route 53 offers ‘Alias’ records (a Route 53-specific virtual record). Alias records are used to map resource record sets in your hosted zone to Elastic Load Balancing load balancers, CloudFront distributions, or S3 buckets that are configured as websites. Alias records work like a CNAME record in that you can map one DNS name (example.com) to another ‘target’ DNS name (elb1234.elb.amazonaws.com). They differ from a CNAME record in that they are not visible to resolvers. Resolvers only see the A record and the resulting IP address of the target record.

Security also plays a key role when configuring a network on AWS. More on security can be found here: http://cloudconclave.blogspot.com/2013/07/aws-security-101-for-oracle-dbas.html

Wednesday, July 10, 2013

EMR : Common use cases

Here are a couple of common use cases for EMR:

1. Creating sessions from weblogs : The sequence of web pages through which

a user navigated is an example of a session. Sessionization is one of the first steps in many types of log analysis and management, such as personalized website optimization, infrastructure operation optimization, and security analytics.

One study used 150 billion log entries (~24 TB) from 1 million users and produced 1.6 billion sessions.

2. Recommendation engine : The EMR cluster reads a history of movie ratings from multiple users regarding multiple movies. Then, it builds a co-occurrence matrix that scores the similarity of each pair of movies. Combining the matrix and each user’s movie-rating history, the engine predicts a given user’s preference on unrated movies.

Tuesday, July 9, 2013

AWS usage, billing, and reserved instances

ICE is an open source utility from Netflix that allows you to track AWS usages, billing, and manage efficient use of AWS:
http://techblog.netflix.com/2013/06/announcing-ice-cloud-spend-and-usage.html?m=1

S3 and HSM

It is possible to use CloudHSM to encrypt data stored in other AWS services, such as Amazon S3. However, the encryption operations must be handled by your application in conjunction with CloudHSM.

S3 client side encryption

You can build your own library that encrypts your objects data on the client side before uploading it to Amazon S3. Or you can use the AWS provided SDK. Currently, only the AWS SDK for Java supports client-side encryption.

S3 SSE without request header

You must include the provide a request header, x-amz-server-side-encryption. S3 SSE

encrypts each object with a unique key. Unfortunately there is no way to enable SSE for a bucket where SSE happens automatically. There is no such bucket policy or bucket setting. Some third party tools take care of automatically using SSE, such as S3 browser: http://s3browser.com/amazon-s3-server-side-encryption.php. You can enable encryption for already uploaded files, and it can also be configured to automatically apply encryption during uploading.

Oracle OEL AMIs running on cluster compute instances

AWS cluster compute instances types (cc1.4xlarge,cg1.4xlarge, cr1.8xlarge, cc2.8xlarge), all require to be run as HVM instances. Currently, all the AMIs from Oracle are PVM only. Windows EC2 instances can only be HVM as PV does not support windows, so Windows AMIs or HVM based Linux AMIs can be used.

OVM based Oracle AMIs can also not be run on cluster compute instance types.

Wednesday, July 3, 2013

Redshift block size

Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a block size of 1 MB, which is more efficient and further reduces the number of I/O requests needed to perform any database loading or other operations that are part of query execution.

Redshift - New node and data distribution

What happens when a new node is added to a Redshift cluster?:

A 2-node cluster will distribute data evenly between two nodes based on a hash of the DISTKEY. If a 3rd node is added, the data needs to be rebalanced amongst the 3 nodes. It’s not just a matter of sending all new data to the 3rd node because that would require lookups to figure out where data is stored. Rather, the node needs to be rebalanced by redistributing the data between the 3 nodes. Redshift takes care of this automatically. Just add the nodes and the data moves.

Redshift redistributes the data as follows:

A ‘new’ set of nodes is created (in the above example, 3 nodes would be created)
Redshift moves the data from the 2-node cluster to the 3-node cluster, rebalancing the data during the copy
Users are then flicked across from the ‘old’ 2-node cluster to the ‘new’ 3-node cluster

This is an example of scalable cloud infrastructure — rather than having to ‘upgrade’ an existing system, it is much more efficient to provision a new system, copy data and then decommission the old system. This is a new way of looking at infrastructure that is quite different to the old way of thinking in terms of physical boxes.