Two years ago, I wrote about electric cars questioning their validity. A lot has changed since that post. Electric cars are starting to get exciting! Tesla has done a lot to raise awareness for electric cars. They’ve made them cool. Continue reading…
Several months ago, I had an employee review. I was relatively new at Avanti at the time. There wasn’t a whole lot at the time they could provide on my review. I did however get very valuable advice. Advice that I’ve taken to heart. I need to improve my communication.
I’m coming out here by saying my manager was absolutely right. I’ve known all along that communication was one of my weaker qualities. The fact that someone has finally told me that I should improve was welcomed. For those of you who do not know me, I thrive on criticism. Constructive criticism of course.
It should come as no surprise that NoSQL has become popular over the past few years. This popularity has been driven in large part by the app revolution. Many new apps are hitting millions of users in less than a week, some in a day. This presents a scaling problem for app developers who seek a large audience.
Scaling a typical RDBMS system like MySQL or MsSQL from 0 to 1 million users has never been easy. You have to setup master and slave servers, shard and balance the data, and ensure you have resources in place for any unexpected events. NoSQL is being touted as the solution to those problems. It shouldn’t be.
NoSQL’s use cases have been mistakenly focused on scalability because of the complexities of putting up an RDBMS and scaling it. Application developers aren’t necessarily interested in becoming server side wizards. They prefer to focus on building out their apps, not scaling the servers powering those apps. These developers are looking for something low cost that can keep up with their needs as they grow. Developers have flocked to NoSQL for these very reasons.
However, when developers look to grow their apps and introduce more complex functionality, they sometimes hit roadblocks due to the way data is stored in NoSQL. A solution to one of those roadblocks is MapReduce. MapReduce brings index like functionality to NoSQL.
My goal isn’t to dispute the importance of NoSQL, but to promote the reality that not all database systems are alike -many serve their specific purposes. NoSQL’s benefits are to bring terrific key-value access with great performance. To some the lack of schema is a benefit that allows the application to control how the data is stored, limiting the need to interface with and configure the database.
Over the years I’ve been looking to put up products that need RDBMS like storage that scales. NoSQL just couldn’t do it for me. Many would agree, but do not know of anything better. Lucky for me, I found VoltDB. To this day, sales people continue to contact me to use their NoSQL solution. I ask: How can NoSQL solve my problems? Are you ACID compliant? How can I merge data from multiple tables? How can I use my data to build out analytics? Most of the time, the sales teams can only sell me on one problem, scaling. They often forget that I have to sacrifice functionality to solve my scaling problems. One should never have to compromise.
If NoSQL is known for scale, how well does VoltDB do it? A picture is worth a thousand, or in this case, a million words.
The chart above shows VoltDB achieving 1m transactions per second (it’s a 1 year-old benchmark). Place this benchmark next to the top NoSQL solutions and you will find yourself with an equal or better performing solution. Best of all, VoltDB does it without sacrificing the common features we need from relational database systems.
The switch from a traditional database like MySQL or MsSQL to VoltDB is simple and can often be measured in hours or days. A switch from a traditional RDBMS to NoSQL on the other hand is likely to be measured in weeks, days if you are lucky.
VoltDB is a NewSQL solution. NewSQL, a term used to avoid the poor scaling stigmata attributed to RDBMS or typical SQL solutions.
NewSQL’s solution solves today’s data problems without creating new complexities. Ever heard of trying to fit a square peg in a round hole? NoSQL is that square peg, and it is doing its best to go through that round hole by solving scaling problems with a different solution, causing new complexities. Complexities that arise when trying pulling complex data sets for analytic purposes, adding BI support, updating schema, or normalizing data.
Many of today’s biggest companies, including Twitter use NoSQL systems. Had NewSQL been around when they had issues scaling would the problem still be solved with NoSQL? Chances are NewSQL would also be their solution. NewSQL builds on the decades of research and innovation attributed to relational databases. They have matured and have been solving many of the world’s most complex data problems.
In case this argument for NewSQL doesn’t quite bring it home for you, I will be writing another article supported by detailed use cases. Until then, please let me know what you think here or on Twitter @francispelland.
If you are wondering why I haven’t gotten around to post this week, this is why. Throughout the week, I’ve been preoccupied with a number of things, including my job hunt and working on a few things for my car.
Lately I’ve been having somewhat of an obsession with LED. For a number of reasons. To name a few… they provide a sharper light that doesn’t look as dull as incandescent, they use up less energy, and you can get away with custom shapes. While my car and the LED obsession isn’t quite over, with the last being worked on later this week, thought I would post an update. That and for some reason I cannot sleep tonight.
It should be to no one’s surprise that ETL is a process that should go out the door. If you read my two prior posts, you will see how newer databases that employ export functionality provide far better ways to capture data and send it data warehouses.
The difference in data quality is stark. Through an ETL process for capturing data and loading it into databases, you have to work through several sources, some of which you may never have the data you need. Sometimes it feels like you are a magician for making data appear. Then you have the export process which sends all data, that you choose, in as raw as a form as possible for the data gurus to play around with it an mold it into terrific stories.
The ETL process is a common practice in nearly all companies that have data that want to analyse or repurpose. Often are times that products cannot be maintained due to resource constraints, that the developers working on the product aren’t aware of the goals from those working with warehouse data, or the cost to update the product is too high to replace ETL with a far better process as I described in my last post.
I’m not in any way going to say that ETL is bad, because unfortunately for many, that process is required and there is no other way to get the data. Some of my greatest data challenges came out of ETL processes earlier on before developing a full platform capable of capturing and sending the data to the warehouse in a straightforward and easy to use format.
Data warehouses work best by storing data by transaction. Storing individual transactions to a data warehouse, as seen in my last post, allows the data to be used in many different ways and in many cases allowing it to be future proof. One of the greater challenges I’ve come across and sure many have come across is finding the best and most efficient way of storing transactions.
Whether you are looking to build an end to end analytics solution with complete adhoc capability or an engine that can make contextual recommendations based on activity, the data stored in the warehouse will be very similar. Then comes the question of finding a way to best store that data without slowing requests or adding strain to servers. This is accomplished through various means including asynchronously pushing data to your warehouse, appending a csv file to import into the warehouse database, or if your database supports it to export data when it is most efficient.
Many will perform this process through more complicated methods known as Extract, Transform, and Load or ETL for short. Many have their reasons to prefer ETL over gathering data from transactions as they happen. In my experience, it leads to a lot less flexibility as the data that is being extracted for warehouse purposes may be too limited.
The databases of today are in many cases built for specific purposes. Some of the more common ones we see every day are relational databases, document-oriented databases, Operational databases, Triplestore, and Column-oriented databases / c-store. Typically relational, document-oriented, operational, and triplestore databases are used to solve frontend database problems. Then you have databases that are column-oriented or similar that focus on solving warehousing and backend database problems. These products don’t need to solve those problems, though they are often best suited for them.
There are many reasons for which databases must be scaled. The majority of the time it must be scaled to accommodate for performance issues as the product grows. Though NoSQL is making a lot of noise these days, it is to no one’s surprise that SQL is still extremely popular. In general the same principles are followed while scaling out any SQL product, be it MySQL, MsSQL, Oracle or even DB2. Scaling is often done to overcome performance issues as the product grows. However, dealing with big data scaling is often done to balance the data across multiple hardware nodes or clusters.
Over the next few weeks I will be starting up a new series which I am hoping will have at least a dozen posts. I am looking to cover a lot of the high level concepts around big data. Scaling, data format, software, and using the data will be amongst the topics I will cover.
It is no surprise that I am looking for a job. Over the past few months I’ve had many interviews and have noticed that all companies had similar problems. BIG DATA. Most of these companies had issues scaling their databases for either performance or storage reasons. Many others were simply unable to pull and use their data efficiently because it would take seconds to get results from a fairly simple query.
I’ll be looking to keep the posts bit sized as this one will be. I look forward to getting feedback or suggestions on topics related to Big Data that I should cover.
Stay tuned, the first post coming shortly will be on scaling.