Data warehouses work best by storing data by transaction. Storing individual transactions to a data warehouse, as seen in my last post, allows the data to be used in many different ways and in many cases allowing it to be future proof. One of the greater challenges I’ve come across and sure many have come across is finding the best and most efficient way of storing transactions.
Whether you are looking to build an end to end analytics solution with complete adhoc capability or an engine that can make contextual recommendations based on activity, the data stored in the warehouse will be very similar. Then comes the question of finding a way to best store that data without slowing requests or adding strain to servers. This is accomplished through various means including asynchronously pushing data to your warehouse, appending a csv file to import into the warehouse database, or if your database supports it to export data when it is most efficient.
Many will perform this process through more complicated methods known as Extract, Transform, and Load or ETL for short. Many have their reasons to prefer ETL over gathering data from transactions as they happen. In my experience, it leads to a lot less flexibility as the data that is being extracted for warehouse purposes may be too limited.
Of the 3, I’ve found synchronous / asynchronous pushing to be the slowest and quite possible the heaviest on resource. The reason for this is due to the fact that warehouse databases are often not made for this purpose. You have to create a connection, send your query and then commit the changes… for each and every write query. In my tests, this was often adding between 50-100ms to each request and added a significant amount of memory and CPU overhead to my processing servers. Unfortunately this method ran into some significant snags during load testing as these databases are often limited by their number of connections. This means that if you have an insanely popular app, you not only need to scale your frontend processing and database servers, but also your warehouse databases.
Scaling is what lead me to the other options for importing data to my warehouses. CSV import is a feature that each warehouse database supports. They can run thousands of rows in seconds and often without breaking much of a sweat. Regardless of load, the amount of connections were normalized since all the warehouse database did was load the csv file at the same specified interval. Where I found an issue here was keeping track of csv that were being imported, imports that failed, and the ensuring I wasn’t overwriting or duplicating data. Problems happen, minimizing issues were the goal. I could have solved all the issues with more time and patience.
The export solution is a unique one that my database, VoltDB, offered to enterprise customers. The enterprise edition is terrific, though many may prefer to stick with the free community edition. The way the export functionality worked is that I would write my stored procedure for writing / updating records on the frontend database. I would extrapolate the transaction into something that would make sense for the warehouse database and send the data to an export table. VoltDB was response for ensuring the data is sent to the warehouse database without errors. If the warehouse servers were to go down for any reason or there were errors, it would keep retrying until it worked. When it was unable to export the data, it would keep track of everything that needs to go out so you could rest easy no data was ever lost. This is a feature I did not know was there, but glad it was when I noticed my warehouse cluster went down due to a network error that lasted 16 hours.
In many cases, people would use the CSV import as it is the most commonly known method of sending / importing data into warehouse databases. If I had to, I would have solved my issues with CSV. However, with a better alternative, I took that. Many databases offer export capabilities. In fact you could create a database table that’s sole responsibility is to be used for export and run commands off of that. The idea is that you shouldn’t be sending data as it happens from your processing servers. Instead you should be running background operations to take care of your data and send it to the warehouse. If you want real time reports, this can still be easily achieved as long as you make the interval for importing at a second or less.
In my next post, I will cover various ETL processes for the sake of providing contrast and comparison. Many may not have the ability to create export tables or even export data as it happens due to limitations in their database. ETL is extremely popular and is commonly used today as a means to populating warehouse databases.