Finances: Marvin’s first skills

With Marvin’s core architecture designed, I started developing the framework.  The framework itself is intended to handle multiple use cases, from supporting personal information to provide contextual experiences to controlling devices.  Marvin’s key differentiator is that it is powered by data, rather than a simple trigger based on an action.  This is why the core of the framework contains APIs that handle data and extends itself into building out analytics.  Behind the scenes exists an ETL process that feeds into various services, including machine learning.

I’ve been feeding the last 7 years of financial data through to Marvin’s core databases, with just over 9000 transactions. The transactions look something like this:

Date Description Original Description Amount Transaction Type Category Account Name
6/22/2017 Ooma OOMA, INC 08887116662 CA XX.XX debit Home Phone Smart Cash Platinum Plus MasterCard
6/22/2017 Costco COSTCO WHOLESALE W159 AJAX ON XX.XX debit Home Supplies Smart Cash Platinum Plus MasterCard
6/21/2017 Transfer to Chequing TRANSFER OUT XX.XX debit Transfer General Savings
6/21/2017 Transfer from General Savings TRANSFER IN XX.XX credit Transfer Chequing
6/20/2017 Costco COSTCO WHOLESALE W1128 OSHAWA ON XX.XX debit Groceries Smart Cash Platinum Plus MasterCard
6/20/2017 Costco WWW COSTCO CA 905-264-8337 ON XX.XX debit Sporting Goods Smart Cash Platinum Plus MasterCard
6/20/2017 Taunton Endo TAUNTON ENDO OSHAWA ON XX.XX debit Doctor Smart Cash Platinum Plus MasterCard

These transactions provide a very good base for generating some initial learning models. By including credit or debit, dates, retailers, amounts, categories for purchases and labels, it enables various types of skills to be identified.  These skills help me understand my finances better while also helping me improve.

Continue reading…

The rise of NoSQL is an opportunity for new RDBMS solutions

It should come as no surprise that NoSQL has become popular over the past few years. This popularity has been driven in large part by the app revolution. Many new apps are hitting millions of users in less than a week, some in a day. This presents a scaling problem for app developers who seek a large audience.

Scaling a typical RDBMS system like MySQL or MsSQL from 0 to 1 million users has never been easy.  You have to setup master and slave servers, shard and balance the data, and ensure you have resources in place for any unexpected events. NoSQL is being touted as the solution to those problems. It shouldn’t be.

NoSQL’s use cases have been mistakenly focused on scalability because of the complexities of putting up an RDBMS and scaling it. Application developers aren’t necessarily interested in becoming server side wizards.  They prefer to focus on building out their apps, not scaling the servers powering those apps.  These developers are looking for something low cost that can keep up with their needs as they grow. Developers have flocked to NoSQL for these very reasons.

However, when developers look to grow their apps and introduce more complex functionality, they sometimes hit roadblocks due to the way data is stored in NoSQL.  A solution to one of those roadblocks is MapReduce.  MapReduce brings index like functionality to NoSQL.

My goal isn’t to dispute the importance of NoSQL, but to promote the reality that not all database systems are alike -many serve their specific purposes. NoSQL’s benefits are to bring terrific key-value access with great performance.  To some the lack of schema is a benefit that allows the application to control how the data is stored, limiting the need to interface with and configure the database.

Over the years I’ve been looking to put up products that need RDBMS like storage that scales. NoSQL just couldn’t do it for me.  Many would agree, but do not know of anything better. Lucky for me, I found VoltDB. To this day, sales people continue to contact me to use their NoSQL solution. I ask: How can NoSQL solve my problems? Are you ACID compliant? How can I merge data from multiple tables? How can I use my data to build out analytics? Most of the time, the sales teams can only sell me on one problem, scaling. They often forget that I have to sacrifice functionality to solve my scaling problems. One should never have to compromise.

If NoSQL is known for scale, how well does VoltDB do it? A picture is worth a thousand, or in this case, a million words.

VoltDB benchmarks from May 2013.

The chart above shows VoltDB achieving 1m transactions per second (it’s a 1 year-old benchmark). Place this benchmark next to the top NoSQL solutions and you will find yourself with an equal or better performing solution.  Best of all, VoltDB does it without sacrificing the common features we need from relational database systems.

The switch from a traditional database like MySQL or MsSQL to VoltDB is simple and can often be measured in hours or days.  A switch from a traditional RDBMS to NoSQL on the other hand is likely to be measured in weeks, days if you are lucky.

VoltDB is a NewSQL solution. NewSQL, a term used to avoid the poor scaling stigmata attributed to RDBMS or typical SQL solutions.

NewSQL’s solution solves today’s data problems without creating new complexities.  Ever heard of trying to fit a square peg in a round hole? NoSQL is that square peg, and it is doing its best to go through that round hole by solving scaling problems with a different solution, causing new complexities. Complexities that arise when trying pulling complex data sets for analytic purposes, adding BI support, updating schema, or normalizing data.

Many of today’s biggest companies, including Twitter use NoSQL systems. Had NewSQL been around when they had issues scaling would the problem still be solved with NoSQL? Chances are NewSQL would also be their solution. NewSQL builds on the decades of research and innovation attributed to relational databases. They have matured and have been solving many of the world’s most complex data problems.

In case this argument for NewSQL doesn’t quite bring it home for you, I will be writing another article supported by detailed use cases. Until then, please let me know what you think here or on Twitter @francispelland.

If your database has export capabilities, use it. Now!

It should be to no one’s surprise that ETL is a process that should go out the door. If you read my two prior posts, you will see how newer databases that employ export functionality provide far better ways to capture data and send it data warehouses.

The difference in data quality is stark.  Through an ETL process for capturing data and loading it into databases, you have to work through several sources, some of which you may never have the data you need. Sometimes it feels like you are a magician for making data appear.  Then you have the export process which sends all data, that you choose, in as raw as a form as possible for the data gurus to play around with it an mold it into terrific stories.

Continue reading…

Extracting, Transforming, Loading data into a warehouse database

The ETL process is a common practice in nearly all companies that have data that want to analyse or repurpose. Often are times that products cannot be maintained due to resource constraints, that the developers working on the product aren’t aware of the goals from those working with warehouse data, or the cost to update the product is too high to replace ETL with a far better process as I described in my last post.

I’m not in any way going to say that ETL is bad, because unfortunately for many, that process is required and there is no other way to get the data.  Some of my greatest data challenges came out of ETL processes earlier on before developing a full platform capable of capturing and sending the data to the warehouse in a straightforward and easy to use format.

Continue reading…

Sending transactional data to warehouse databases as it happens

Data warehouses work best by storing data by transaction.  Storing individual transactions to a data warehouse, as seen in my last post, allows the data to be used in many different ways and in many cases allowing it to be future proof. One of the greater challenges I’ve come across and sure many have come across is finding the best and most efficient way of storing transactions.

Whether you are looking to build an end to end analytics solution with complete adhoc capability or an engine that can make contextual recommendations based on activity, the data stored in the warehouse will be very similar. Then comes the question of finding a way to best store that data without slowing requests or adding strain to servers.  This is accomplished through various means including asynchronously pushing data to your warehouse, appending a csv file to import into the warehouse database, or if your database supports it to export data when it is most efficient.

Many will perform this process through more complicated methods known as Extract, Transform, and Load or ETL for short. Many have their reasons to prefer ETL over gathering data from transactions as they happen. In my experience, it leads to a lot less flexibility as the data that is being extracted for warehouse purposes may be too limited.

Continue reading…

Building a data warehouse

The databases of today are in many cases built for specific purposes.  Some of the more common ones we see every day are relational databases, document-oriented databases, Operational databases, Triplestore, and Column-oriented databases / c-store. Typically relational, document-oriented, operational, and triplestore databases are used to solve frontend database problems.  Then you have databases that are column-oriented or similar that focus on solving warehousing and backend database problems.  These products don’t need to solve those problems, though they are often best suited for them.

Continue reading…

The challenges of scaling your data vertically

There are many reasons for which databases must be scaled.  The majority of the time it must be scaled to accommodate for performance issues as the product grows.  Though NoSQL is making a lot of noise these days, it is to no one’s surprise that SQL is still extremely popular.  In general the same principles are followed while scaling out any SQL product, be it MySQL, MsSQL, Oracle or even DB2. Scaling is often done to overcome performance issues as the product grows. However, dealing with big data scaling is often done to balance the data across multiple hardware nodes or clusters.

Continue reading…

Starting a new series. The BIG data series!

Over the next few weeks I will be starting up a new series which I am hoping will have at least a dozen posts.  I am looking to cover a lot of the high level concepts around big data.  Scaling, data format, software, and using the data will be amongst the topics I will cover.

It is no surprise that I am looking for a job. Over the past few months I’ve had many interviews and have noticed that all companies had similar problems. BIG DATA. Most of these companies had issues scaling their databases for either performance or storage reasons.  Many others were simply unable to pull and use their data efficiently because it would take seconds to get results from a fairly simple query.

I’ll be looking to keep the posts bit sized as this one will be. I look forward to getting feedback or suggestions on topics related to Big Data that I should cover.

Stay tuned, the first post coming shortly will be on scaling.

Great design and user experience for smiles

I’m taking a bit of a break from writing my contextual series to address a few crucial problems that I’ve been seeing on a daily basis. That is the lack of a good user experience. On any given day, I visit close to 50 different sites and use over a dozen apps.  Of those I would say that 75% of them I would not visit or use if they had decent competitors.  The biggest problems with sites and apps these days is that they like to bombard the users with information.  I get that you want to make money from your ads, but displaying them elegantly will yield better results.  Design is also crucial, these days the simple look works and it looks great.

I’ve built my site around the same principles that I talk about. Yes my blog has ads, but I in no way try to distract my readers from the content. I push users to signing up to my newsletter in a box that appears in the top right. That box is set only to be seen once a week.  As for the look and feel, I kept things simple with no fancy logos, gradients, etc. While I may not get a huge amount of traffic, I am sure this design and approach could be used with great success from more popular bloggers.

Continue reading…

Contextual Series: Gathering User Information

This is the third part of my contextual series which will focus on a few technical details on how to gather user information.  I cannot stress this enough, be wary of the user’s privacy.  Make sure that anything you do is covered under your privacy policy and that the data you gather is done with the user’s consent.  As soon as a user considers their experience with your product as being creepy, you have lost the user.

Social networks like Facebook provide us with a wealth of data.  Social networks are a great way to get a user to consent to data without having to get them to fill in forms.  The information these social networks provide to you are bound by terms of service, terms of user and of course a privacy policy. Make sure you keep these in mind while thinking of ways you can use the data.

Before considering social networks as your primary source for data, keep in mind that with the modern web and native applications, your product may already have access to a lot of data that may be useful to you, especially regarding the user’s location. More specific information on a user, like age and gender will require input methods. If the user’s social connections are important to your product or you don’t want to submit users to numerous input fields social networks would be important.

Continue reading…