Wednesday, July 3, 2013

Redshift - New node and data distribution


What happens when a new node is added to a Redshift cluster?:
A 2-node cluster will distribute data evenly between two nodes based on a hash of the DISTKEY. If a 3rd node is added, the data needs to be rebalanced amongst the 3 nodes. It’s not just a matter of sending all new data to the 3rd node because that would require lookups to figure out where data is stored. Rather, the node needs to be rebalanced by redistributing the data between the 3 nodes. Redshift takes care of this automatically. Just add the nodes and the data moves.
Redshift redistributes the data as follows:
  • A ‘new’ set of nodes is created (in the above example, 3 nodes would be created)
  • Redshift moves the data from the 2-node cluster to the 3-node cluster, rebalancing the data during the copy
  • Users are then flicked across from the ‘old’ 2-node cluster to the ‘new’ 3-node cluster
This is an example of scalable cloud infrastructure — rather than having to ‘upgrade’ an existing system, it is much more efficient to provision a new system, copy data and then decommission the old system. This is a new way of looking at infrastructure that is quite different to the old way of thinking in terms of physical boxes.

No comments:

Post a Comment