Sunday, July 6, 2014

Cloud and security

Cloud computing conception

Depending upon who you talk to cloud computing is a new or old computing paradigm. Here is an article discussing cloud computing being pioneered in the 1960s:

Amazon EMR fact and information

Here are some information about EMR:
1. Job Flow : A Job Flow is an Amazon term for an end-to-end processing of data that occurs through a number of compute steps.  A Job Flow is defined by the MapReduce application and input and output parameters.  An EMR cluster does not need a Job Flow as data processing can be done interactively using Hive, Pig, Impala, or some other language.
2. Task group : The task group is optional. The task group instances do not have HDFS storage so data needs to be transferred to these nodes by the master mode. The task group can off load heavy computational work from the core group instances.
3. S3 : Amazon s3 is used for the input and output storage of data sets to processed and analyzed.
4. AMIs : The EMR cluster nodes are maintained by Amazon.  Amazon regularly updates the EC2 AMIs with newer releases of Hadoop, security patches, and more.
5. Map and reduce : The map procedure takes data as input and filters and sorts the data down to a set of key/value pairs that will be processed by the reduce procedure.  The reduce procedure performs a summary procedure of grouping, sorting, or counting the key/value pairs. For example, the map procedure parses out the date and time and treats this data element as a key. Then, a reduce procedure can determine a count of each day (date and time). 
6. s3cmd : s3cmd is used at the OS command to load data into S3.
7. Job Flow Scheduling : To schedule a Job Flow to run every hour you can configure cron to execute the script.
8. EMR technologies supported as steps : EMR supports six technologies to be used in steps in the EMR cluster:
a. Hive : Open source data warehouse package. The Hive Query Language (HQL) is a lot like RDBMS SQL. It is best for organizations with strong SQL skills.  Also has extensions to support direct access to DynamoDB to directly load EMR from DynamoDB.
b. Custom Jar : Core Hadoop Jave libraries preloaded into the EMR cluster.
c. Streaming : Allows you to write Amazon EMR Job Flows in Ruby, Perl, Python, PHP, R, Bash, or C++.   Convert an existing ELT job to run in EMR using streaming.
d. Pig : Pig is a data flow engine that is preloaded in the EMR cluster.  Good fit for organization with strong SQL skills.
e. Impala : Impala is similar to Hive but works faster in certain use cases. More here:
g. HBase : HBase is efficient way of storing large quantities of sparse data using column-based compression and storage. In addition, HBase provides fast lookup of data because data is stored in-memory instead of on disk. 
9. Filter statement : A Map custom JAR application uses the filter statement which is like a WHERE clause on a SQL statement.
10. GROUP: A Reduce custom JAR application uses the GROUP statement which is like a GROUP clause in a SQL statement.
11. Limited structure data and late binding : Unlike data warehousing solutions based upon OLAP or RDBMS, Amazon EMR clusters work with unstructured data and perform late binding of the schema.
12. Performance on small data sets : When running Hive queries against EMR, the run time will appear shocking slow against small data sets when compared to running against a traditional RDBMS. The structure nature of the data sets and indexing capabilities make the RDBMS faster.  EMR (MapReduce) is made for large , unstructured data sets.
13. Mahout : Mahout is supported in EMR.

Geolocation information for IP addresses

MaxMind ( can be used to provide  you the ability to identify the location, organization, connection speed, and user type of your Internet visitors. 

Thursday, June 26, 2014

Web pages in Amazon S3

      Each client page is an object in Amazon S3 which is addressable by a unique DNS CNAME such as  Where translates to the IP address of the S3 endpoint and /foo/bar.html is the unique name given to the object in S3.

Oracle Amazon Linux updates

The Amazon Linux AMI repositories are available in S3, configured such that instances with EC2 IP addresses can access the repositories and download packages onto the Amazon Linux AMI instances.  Once a package has been downloaded from the Amazon Linux AMI repository to an instance, any further actions taken with that package are up to the customer who launched the instance.

Amazon EMR termination using Data Pipeline

Data Pipeline provides a “terminateAfter” functionality for all activities, including EmrActivity. It is possible to set terminateAfter to be relative to the start time.   It is all possible to wrap your existing EMR jobflow in a Data Pipeline EmrActivity and then set the terminateAfter on the EmrCluster object. 

Determining county of origin for directing web traffic at the edge

The MaxMind API ( can be used as either a Nginx module or as a web service. The API is only 99.98% accurate and does not detect proxies. 

AWS AMI hardening

    AWS  AMI hardening procedures and industry standards can be found in this AMI hardening article:

The client is responsible for the initial security posture of the machine images distributed. Private AMIs need to be configured in a secure way that does not violate the AWS Acceptable Use Policy. Software referenced should be up to date with relevant security patches, and adherent to the following:
All AMIs
Disable services and protocols that authenticate users in clear text. (e.g. telnet and ftp)
Do not start unnecessary network services on launch. Only administrative services (SSH/RDP) and the services required for your application should be started.
Securely delete (use Sysinternals,  SDelete or Eraser) all AWS credentials from disk and configuration files.
Securely delete any third-party credentials from disk and configuration files.
Securely delete any additional certificates or key material from the system.
Ensure that software installed on your AMI does not have default internal accounts and passwords (e.g. database servers with a default admin username and password)
Ensure that the system does not violate the Amazon Web Services Acceptable Use Policy. Examples include open SMTP relays or proxy servers.
Windows specific
Ensure that all enabled user accounts have new randomly generated passwords on instance creation. The EC2 Config Service can be set to do this for the Administrator account on next boot, but you must explicitly enable this before bundling the image.
Ensure that the guest account is disabled.
Clear the windows event log.
Do not join the instance to a windows domain. 
Do not enable any file share points that are accessible by unauthenticated users. It is recommended to completely disable file share

AWS encrypting data at rest

Here is good white paper on encrypting data at rest on AWS:

Amazon now offers Amazon EBS native encryption:
S3 has SSE encryption, client side encryption and SSE with key managed by you:
All data in Glacier and Redshift is automatically encrypted.

Amazon SNS : Mobile push and SMS messaging

Here is some specifics on how Amazon SNS works.  AWS SNS allows you to use one notification system regardless of the device.

You also asked about code samples:  Here is a code that uses the AWS SDK for Java to publish a message to a GCM Model endpoint:
Here is an example using the REST/Query API :

There is no limit on how many messages can be sent through a single topic.  This lists the only SNS limit: : Default limit (meaning it can be raised, not a hard limit) of 3,000 topics.   Have one topic, or at most one topic for each device type.  This is because one topic can support deliveries to multiple endpoint types. For example, you can group together iOS, Android and SMS recipients. When you publish once to a topic, SNS delivers appropriately formatted copies of your message to each subscriber.

Cloud Foundry on AWS

      It is possible to run Cloud Foundry on AWS, here is a good blog post: BOSH is a way to deploy to AWS with code on github:

Amazon Redshift - What is new

Here are some new things for Redshift:

Monday, June 23, 2014

EC2 Instance create date and time

Some times you may want to retrieve the creation date and time of an EC2 instance.
From the docs, an EC2 instance has the property launchTime(, you can easily build a boto script that queries for all instances and reports the launchTime(

Route 53 weight average and record sets returned

When processing a DNS query, Amazon Route 53 searches for a resource record set that matches the specified name and type. If a group of resource record sets have the same name and type, Amazon Route 53 selects one from that group. The probability of any one resource record set being selected depends on its weight as a proportion of the total weight for all resource record sets in the group:
For example, suppose you create three resource record sets for The three A records have weights of 1, 1, and 3 (sum = 5). On average, Amazon Route 53 selects each of the first two resource record sets one-fifth of the time, and returns the third resource record set three-fifths of the time.

AWS CLI multiple profiles

Details on multiple AWS CLI profiles can be found here:

Here is a summary of the highlights:
1. Location and name of configuration file:  Directory ~/.aws. File name is config.

2. Define a profile by tying this at command prompt: aws configure --profile tomlaszeast (tomlaszeast is the name of the profile)

3. Setting the profile to use is done using this command:
aws configure --profile tomlaszeast

AWS EC2 user data

You can perform any bootstrapping action you would like using user data.  Here is an example of installing Apache, PHP, and MySQL.  Then Apache, PHP and MySQL are started.  Then a sample application is installed on Apache.

yum -y install httpd php mysql php-mysql
chkconfig httpd on
/etc/init.d/httpd start
cd /tmp
mv examplefiles-as/* /var/www/html

Sunday, June 22, 2014

AWS EC2 instance user name

When logging into an EC2 instance using SSH, you may receive an error. You check to make sure you have the right instance name, IP address, PEM key etc. but it still fails. You may be using the incorrect user name. The EC2 instance user names by OS are listed here:

For example, ubuntu servers are “ubuntu@” amazon is “ec2-user@” and other ones such as Debian are root@

ELB Health Check return code

You can have TCP and HTTP health checks. A TCP health check will simply check if the web page exists (can be pinged essentially) a HTTP health check needs to return a 200 to pass the health check. More can be found here:

AWS multiple PEM for on EC2 instance

You will probably want to have a different PEM file for each developer that will be accessing EC2 instances. This is good practice so that when a developer leaves the company or you want to remove their privileges to SSH into an EC2 instance. Here’s the link to create multiple pen keys for EC2 instances:

AWS ELB instances tied to ELB and ip addresses

The Amazon Elastic Load Balancer takes care of scaling out the number of underlying EC2 instances that make up the software virtual load balancer that is the Amazon ELB. However, you may want to determine the number of instances that are servicing your requests if you plan to scale out the number of users for a large promotion or for processing of a large number of users.  Determining the number of underlying instances is easy. You can simply use the host or dig commands using the ELB DNS name. There will be a minimum of one EC2 instance for each AZ that the ELB is servicing.

1. [ec2-user@ip-10-0-0-50 ~]$host has address has address

2. [ec2-user@ip-10-0-0-50 ~]$ dig

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.28.amzn1 <<>>
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 813
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

; IN A


;; Query time: 1 msec
;; WHEN: Fri Jun  6 12:31:34 2014
;; MSG SIZE  rcvd: 113

Amazon RDS using private IP to connect to database - not the right approach

You should always connect to your Amazon RDS instance using the RDS endpoint in the AWS console. However, some IT folks chose to use the private IP address of the RDS instance.  It is easy for you to determine the private IP address of your RDS instance by using the host or dig commands as follows (Keep in mind this is not recommended but it shows how easy it is for IT personnel that don't want to use the RDS endpoint can do so):

[ec2-user@ip-10-0-0-50 ~]$ host is an alias for has address
[ec2-user@ip-10-0-0-50 ~]$ ping
PING ( 56(84) bytes of data.
--- ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9792ms

[ec2-user@ip-10-0-0-50 ~]$ dig

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.28.amzn1 <<>>
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25864
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

; IN A


;; Query time: 19 msec
;; WHEN: Fri Jun  6 12:28:44 2014
;; MSG SIZE  rcvd: 132

Network interoperability lab

I found out about a place where networking interoperability is the main focus.  It happens to be where I live in New Hampshire.

S3 read consistency

Here is a great blog post on Amazon S3 read consistency. S3 read consistency has a big impact on how  other AWS services (such as EMR) and applications use S3.

AWS Web site on S3 and Pen testing

You're not allowed to pen test AWS API endpoints, only your own EC2/VPC instances and config. More information here on this blog post:

If you are hosting a static site on S3,  you should read the risk and security white papers ( They discuss how AWS regularly scans S3 for vulnerabilities and performs regular penetration testing. The ISO 27001 certification also validates that. 

AWS Penetration Testing without having to fill out the pen testing form

Penetration testing is something that customers like to do when running on AWS.  You have to be pre approved to run a pen test on AWS unless you use an Amazon Marketplace AMI from Tenable.

You can read about the Tenable solution here:

Here is the Amazon Marketplace AMI:

Here is the form if you were not using the Tenable solution:

OpenSwan on AWS

A common use case for using a third party VPN solution such as OpenSwan is to connect two regions VPCs through the use of an IPSec VPN server.  
First, set up a VPC in both regions with, here is what I did:
Region 1 (US-West-2) - VPC with private subnet
Region 2 (Australia)- VPC with private subnet


Configure the VPN server software for the EC2 instances - Region 1


Step 1
sudo yum install openswan

Step 2
nano /etc/ipsec.conf

Step 3
sudo vi /etc/ipsec.d/vpc1-to-vpc2.conf

Step 4
conn vpc1-to-vpc2
 leftsubnet=<VPC1 CIDR>
 rightsubnet=<VPC2 CIDR>

Step 5
sudo vi /etc/ipsec.d/vpc1-to-vpc2.secrets

Step 6


Configure the VPN server software for the EC2 instances - Region 2

Step 7
sudo vi /etc/ipsec.d/vpc2-to-vpc1.conf

Step 8
conn vpc2-to-vpc1
 leftsubnet=<VPC2 CIDR>
 rightsubnet=<VPC1 CIDR>

Note the CIDR needs to include the block range. For example:

Step 9
sudo vi /etc/ipsec.d/vpc2-to-vpc1.secrets

Step 10


Configuration in each region


Step 11
sudo service ipsec start

sudo chkconfig ipsec on

sudo vi /etc/sysctl.conf

net.ipv4.ip_forward = 1

sudo service network restart


Test your connections


Step 1 - Region 1

Step 2 - Region 2

AWS EBS Performance

Here is a very good presentation on Amazon EBS performance from reInvent in 2013:


Thursday, May 22, 2014

Oracle Database Huge Pages

Oracle DBAs will use Oracle Hug pages to increase Oracle database performance :

The only requirement for huge-pages (2MB pages) is that you run a HVM instance. However, all the Oracle AMIs are PVM.

For now to use huge pages with Oracle on EC2,  the best option would be to go with SUSE HVM AMI and then install the Oracle Database on the EC2 instance. 

Friday, May 16, 2014

MySQL horizontal scaling

ScaleBase is a distributed database built on MySQL and optimized for the cloud. It is a relational database cluster that dynamically optimizes workloads and availability by logically distributing data. ScaleBase automates the data lifecycle, including analysis, data migration and node rebalancing.  ScaleBase provides an easy to manage horizontally scalable database cluster built on MySQL that dynamically optimizes workloads across multiple instances.  is  based in Newton, MA. The AWS Marketplace offering can be found here:  ScaleBase provides the scalability and availability benefits of NoSQL databases while using a relational database.

Friday, May 2, 2014

Redshift SQL query tools

      Here are two tools that can be used with Amazon Redshift to issue query commands:      

1    1. SQL Workbench :
The information required regarding your Redshift cluster are:
A. JDBC or ODBC connection string: example JBDC connection string looks like this: jdbc:postgresql://
B. User ID : master
C. password : password

2. Aginity :                       A. Server (endpoint): example looks like this:
B. User ID : master
C. password : password
D. Database: databasename
E. Port: 5439

The nice thing about Aginity is that there is no JDBC driver to install. With SQL Workbench, you will need to download and install 

More on where to find the  JDBC driver and configure with SQL Workbench
2.     Add the downloaded driver JAR to the driver to the PostGres Driver in SQL Workbench

Thursday, April 24, 2014

AWS Import/Export file limits

AWS import/Export can handle volumes of larger then 1TB to be stored on Amazon EBS volumes. However, there is a intermediate step using S3.  If your storage device’s capacity is less than or equal to the maximum Amazon EBS volume size of 1TB, its contents will be loaded directly into an Amazon EBS snapshot. So, in theory no size limit. AWS does not mount the file system on your storage device, nor is a file system required to be present. AWS Import/Export performs a block for block copy from your device to an Amazon EBS Snapshot. If your storage device’s capacity exceeds 1TB, a device image will be stored within your specified Amazon S3 log bucket. You can then create a RAID of EBS volumes using software such as Logical Volume Manager, and copy the image from Amazon S3 to this new volume

IAM users and billing information

    By default, IAM users do not have access to the Account Activity or Usage Reports pages. However, as account owner you can grant IAM users permission to see either or both. You can then activate access to the billing pages, and those IAM users will have access to the billing pages according to the permissions you grant. (You can deny them access to some billing information.)