The challenges of scaling your data vertically

There are many reasons for which databases must be scaled.  The majority of the time it must be scaled to accommodate for performance issues as the product grows.  Though NoSQL is making a lot of noise these days, it is to no one’s surprise that SQL is still extremely popular.  In general the same principles are followed while scaling out any SQL product, be it MySQL, MsSQL, Oracle or even DB2. Scaling is often done to overcome performance issues as the product grows. However, dealing with big data scaling is often done to balance the data across multiple hardware nodes or clusters.

Most SQL products are scaled in clusters called shards.  Each shard contains one or two masters and several slaves. Master servers are responsible for writing data whereas the slaves are responsible for reading data.  NoSQL has become over the years as they don’t require complex infrastructures as we see with SQL. In order to reduce calls made to the databases, a caching layer is added. These are often easy to put up and are cheap to run. As the product grows, the infrastructure can end up looking like this picture below.

Source: Infoq.com

Source: Infoq.com

I had to learn this graphic the hard way. Not expecting the product to be popular immediately after launch, we delayed scaling. To our surprise we hit 60k users after the second day.  After learning the hard way, the game was scaled and was able to hit over 250k a few days later. Foresight is a great thing to have though often it is best to be ready to scale first.  You must determine how you expect to grow.  Is content going to be created constantly and the database is going to grow or will it remain stable over time?

A product design that doesn’t require a lot of data to be added to the databases will often benefit from a system that replicates databases and makes use of caching. Write heave applications will take the approach of growing their infrastructure vertically by splitting content up between shards and adding shards over time.

When dealing with partitioning, you will need to determine the key on which the data will be partitioned. With that key, an algorithm can be created that will be used to determine which shard the data will go to upon a read or write. For example: Dealing with user registrations you have 3 shards. The 1st user is saved on the 1st shard, the 2nd user on the 2nd shard and so on.  I often recommend storing values to identified shards based on a hash value of the key (like uid).  One thing to keep in mind is that you will be growing and you likely will increase the number of shards, you will need to rebalance data when that happens.

There are a great deal of alternatives when trying to scale out your databases. Some are complex and take a lot of time to plan while others are as simple as setting up replication. I’ve been researching alternatives to scaling out SQL for years. With NoSQL coming up as a great solution, it did not meet my goals. There is a reason why relational databases are popular today. Joining tables, subqueries, stats and various other functions are often required.  NoSQL attempts to accomplish several of those features through map-reduce, but it isn’t the same. This is why I have been working more closely with NewSQL solutions like VoltDB. They allow many SQL / relational database features all while being built from the ground up to scale.

Keep an eye out. My next post will be on scaling large amounts of data and using warehouse databases like GreenPlum, Vertica, and Teradata.