For most, data is as valuable as gold. Often, it isn’t treated as such. Why is that?
I’ve often referred to data as that love hate relationship that many of us have had with school. It’s something that many of us felt that we needed to do just enough to pass, whereas others felt they needed to go above and beyond to get top marks. Who’s the winner here?
Continuing with the school analogy, those who did enough just to pass is the equivalent of spending just enough time on your data for features to work. Whereas those who would go above and beyond, are likely over engineering their data to do everything and the kitchen sink. The reality is that neither are likely to be where you want to be, although in my experience you want to edge closer to having it all to put your teams into overdrive.
First, let’s talk about doing just enough. This is often a place where many startups will be. The reason is simple, they need to show value and attract customers quickly. More often than not, the features are attracting customers over the quality of the data. Depending on the stage of the company, they may continue to be focused on pumping out more features, in turn creating further downstream complications that engineering teams dread to take on. Refactoring data is often more challenging and/or time consuming than refactoring code.
Next, you have the companies who have an immense love for data. They tend to try and think of every possible use case for the data between now and the next few years. The great thing about it is that you may be able to do anything from powering the front end of your apps, to applying machine learning, producing powerful reports, and even create mutations / subsets of your data. As you can imagine, this can add a lot of overhead and can come at the expense of agility, especially if you have smaller teams.
In the majority of my experiences and conversations, most are doing just enough because features are king (and I agree!). However, as they look at scaling out teams and adding more engineers, you hit a breaking point.
Symptoms that you’ve hit limits in your data models include:
- Challenges maintaining API contracts and ensuring all consumers conform to the latest.
- Building new features may result in a lot of duplication of code or data, sometimes leading to a lot of tech debt (who wants tech debt on a new feature?!).
- Performance issues with either reads (lack of indexes) or writes (too many indexes).
- Data requires an app / processing layer to make sense of it. Business logic needs to be replicated for each consumer of the data.
- Creating tables or columns that are used for subsets of features, with many values being null. This depends on the use case and sometimes better to consider the intent of a data model vs simply adding new characteristics.
When the symptoms start to break down your velocity or they start to limit what you can do, like limit your features, that’s when you need to rethink your data. Your code is likely to evolve over the months or years, so should your data models. This is where engineers need to sit with product to figure out where to take data models next. You may realize one of two things, product doesn’t have a need for better data and therefore it is an engineering only initiative or that the product roadmap has data rich feature sets that can affect the approach.
Regardless of your approach, it’s good to understand your current challenges, ones that you may face in the short/medium term, and what you may want to enable. As you build out a plan to level up your data, consider creating standards or depending on your needs creating standard models.
Creating standards comes as given, but often neglected, especially as ORMs have gained as much popularity as they have over the years. ORMs have some great features when you are starting, but shouldn’t be used as you mature. ORMs can make migrations riskier and sometimes create a layer of abstraction that creates obscurity to the developers. Because it can be easy as adding a key to your object to have a new column added, developers may be skipping crucial steps.
If you’re going to spend time trying to come up with working data models and spending all this time working through use cases, you’ll want standards. These standards are not only going to help define how everything is created, but also ensure long term maintainability. You may want to define how new tables are creates, what they are required to contain at a minimum, how columns should be named, guidelines for the data and also guidelines for creating indexes. These standards should be included with code reviews and reviewed by a DBA (should you have that luxury).
As you go along working towards figuring out what you’ll want to do with your data, you may also come to the realization that you can not only create standards for your data, but you can standardize it. The benefits of standardized data models is that it creates a starting point for data models that you can be inherited and also extended. A great example of this can be in e-commerce, where you have the concept of items. While all items may have a name, a description, some images, prices, weight, dimensions, etc. there are some items that may want more. You can have electronics that you want to specify additional specs, food that will have nutrition information, or jewelry that contains 3D views. There are numerous ways to extend the originating data set, but the intent is for it to be extensible out of the box and should also in theory be capable of functioning without the supplemental data.
Data standards are something that I think many of us have learned in school but is often seen as less practical in the real world, because it creates overhead. The code to support these standard data sets can sometimes be seen as a failure point and teams may sometimes be tempted to update them as a shortcut. Discipline is required for it to be successful. As your teams and product evolves, it becomes something that will help velocity and enable more robust features or capabilities throughout the pipeline. Not only that, it simplifies performance tuning and can also supercharge reporting.
In the end, a good data practice often becomes as important as the features of your app. This may not be something that product cares about or even needs to understand. As engineers, it’s important to communicate the benefits and keep a close eye on the tech debt caused by aging data models. At the very least, when considering an overhaul of your data models, put together some standards and add them to code review processes. Take the time to teach the org about good data practices and why they are important. But in order to take it a step further, try to create more generalized data models that can be extended.
Just like micro-services may not be suitable for all, there is no one size fits all for approaches with data. However, think of data as more of storage for a feature. Data can power a feature, enable deep understand of product usage (reporting), create catered experiences for users through machine learning, and even help make decisions on behalf of users through machine learning.
Treat data like the gold nuggest that it is!