Building a data warehouse

The databases of today are in many cases built for specific purposes.  Some of the more common ones we see every day are relational databases, document-oriented databases, Operational databases, Triplestore, and Column-oriented databases / c-store. Typically relational, document-oriented, operational, and triplestore databases are used to solve frontend database problems.  Then you have databases that are column-oriented or similar that focus on solving warehousing and backend database problems.  These products don’t need to solve those problems, though they are often best suited for them.

Warehouse data is used to power functions like analytics, business intelligence, or use to power recommendations.  With those goals in mind, it is important to ensure the data sent to and captured by the warehouse database is in its transactional form.  This means you don’t want the resulting value of a row, instead you want to track how a particular event is modified. Is the value of an object increase, decreasing, or changing value? As you can imagine the approach taken differs greatly from a frontend database where all you care about is the data you will be showing to the user in a simple query.

Say you have a table for points that you use for a leaderboard in your app.  Your table would look something like this:

Column Type
uid bigInt
points int
lastUpdate dateTime

Performing a query to retrieve and sort the leaderboard would end up looking like this:

uid points lastUpdate
5 6000 2014/06/16 11:11:11
1 4700 2014/06/16 06:46:40
100 4200 2014/06/16 12:14:05
45 3800 2014/06/15 07:46:34
11 2500 2014/06/16 10:10:08

When you look at it from a warehousing perspective, it varies significantly from the above. As I mentioned, you will be focusing a lot more on the transactions, less about the results. The goal is to keep your transactional data as abstract as possible to allow flexibility in the future. As a result your schema will end up looking something like this:

Column Type
uid bigInt
increase int
decrease int
sum int
time dateTime

Let me explain how I got to that and how you will populate the database. Whenever you will increase the points for a user, you will submit an insert query with the incremental value and the sum. If you decrease the user’s points, you will send that value in the decrease column and the sum will have that same value in negative. If you simple set the value you will only update the sum column. With that in mind, if we were to select all rows for uid 5, the rows would look like this:

uid increase decrease sum time
5 1000 0 1000 2014/06/13 20:20:11
5 500 0 500 2014/06/13 23:10:10
5 0 500 -500 2014/06/14 02:10:50
5 1500 0 1500 2014/06/14 04:11:04
5 2000 0 2000 2014/06/14 14:40:30
5 2500 0 2500 2014/06/15 20:20:11
5 0 700 -700 2014/06/16 05:20:14
5 0 500 -500 2014/06/16 07:41:24
5 200 0 200 2014/06/16 11:11:11

With this transactional data you can gather various types of results. For one you would easily be able to get final values by adding up the sum column.  Additionally, you are able to determine results including how much users earn or lose per day / per week / per month. The possibilities are endless as you have the raw transactional data which you can use in different ways as your need for intelligence shifts.

Of course building a warehouse database requires you have the right tools.  In my experience, c-store databases have been the best at accomplishing those tasks.  Many of these have community editions though enterprise editions could make quite the dent on your wallet. I’ve deployed warehouse databases on Greenplum, Teradata and Vertica. All are similar, though they each have their own quirks and benefits, I won’t get into them here.  The great thing about these databases is that the indexes are different than what many are used to in a typical relational database, which has indexes based on row’s values.  These databases store data tables as sections of columns. The benefits are in their ability to aggregate large amounts of data.  Some of the biggest users of c-store databases have amassed petabytes of data and are using that data to power their intelligence.

In my next post, I will cover some of the most efficient ways of having a frontend and backend/warehouse database running.  There are use cases for each database type. With the age of context and big data growing in popularity, it is useful to understand how warehousing could improve your product through ways of intelligence and recommendations.