The ETL process is a common practice in nearly all companies that have data that want to analyse or repurpose. Often are times that products cannot be maintained due to resource constraints, that the developers working on the product aren’t aware of the goals from those working with warehouse data, or the cost to update the product is too high to replace ETL with a far better process as I described in my last post.
I’m not in any way going to say that ETL is bad, because unfortunately for many, that process is required and there is no other way to get the data. Some of my greatest data challenges came out of ETL processes earlier on before developing a full platform capable of capturing and sending the data to the warehouse in a straightforward and easy to use format.
A crucial first step in the ETL process is knowing everything that is available at hand. How many databases do you have? How many tables within each database? How is the data structured? Are any files available? You will be pouring through a lot of data to start finding ways to answer complex business problems. All’s hoping that the data you need is readily available.
Assuming all is available, your next step is identifying the problem, the question you are attempting to answer. You may get a few simple questions to solve, though chances are most will be quite difficult and will require some creativity to find all the answers. Believe me, I know. I’ve gone through so many discussions with my peers, letting them know that certain questions can be very complicated to answer or in some cases impossible. Solving the questions are not always as straightforward as having the answer readily available at your finger tips. Like trying to solve “how many users logged on in the past day?” and having no last login table.
Based on the assumption you get, often from the stakeholders, you need to understand how data is stored very well. In the case of finding how many users logged in per day without a last login table, you would be able to accomplish this in a number of ways. For instance, you can have a look through all tables for users. Aggregate all the data and group by user and find any of their latest activity. This may be one of the easier examples, so let’s take something like updating their profile pictures. If you don’t have timestamps on user profile pictures, but have the file stored on your servers, you would be able to put a date and time for each user changing their profile picture and when.
As you can see, the process of identifying data to use can become quite laborious. But then there are times where timestamps aren’t even available in your frontend database and suddenly you are struggling to make sense of any data. There are times where things would either need to change or you won’t be able to provide a historical snapshot. Without being able to add timestamps to your frontend tables for example, another alternative for getting users per day would be from that point on to start tracking all the changes happening on the database. You may choose to copy the database, import it into the warehouse and add the data you need or to compare stats on a daily basis.
I would have liked to take a more specific approach on this topic of ETL, however problems faced by each company are quite different. It would be difficult for me to find some great use cases that would stand out. If you feel there is an example you’d like me to cover in far more detail, I would be happy to look into and write about it.
Based on the feedback I’ve received on my last post regarding export capabilities from some database, I will be covering that feature in far more detail. It is certainly a feature that has benefited me a lot and as you have likely noticed by this post, it is one that can result in the best uses of your data.