How a new data storage pattern can help to keep your data scientists happy!


In this article, we’ll try to shed some light on how we can evolve our application design to keep data scientist happy. More importantly, we’ll explain how to bridge the gap between a more traditional approach of designing software systems and software that creates a solid foundation for state-of-the-art self-learning solutions.

Before diving in, we’ll get started with setting the scene for this article:

  • Data science is a very broad term and we’ll have a look at what is needed for a data scientist to succeed.
  • Afterwards we define what we consider to be a more traditional pattern for storing data and what this implies
  • Next, we will introduce a new approach that is focused on data first
  • Bringing all of this together to explain how this can improve the lives of our data scientist colleagues

Data Science

Being a data scientist is quite vague these days. Let’s start with some certainties, i.e. a good old Wikipedia lookup:

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. […] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, domain knowledge and information science.

Hey, hold on, so if I want to be a data scientist I need to major in math and statistics, have a degree in computer science and have some great understanding of the domain at hand. (I admire you if you do not feel overwhelmed yet).

A great representation of the complexity of an AI project is visualized above.

The truth to this is: because of the broad skill set that is needed to fulfill this goal, whenever you ask a data scientist what his role is, you could get very different answers depending on the company or even the team they work in.

In an attempt to split responsibilities and skill sets: the concept of data engineering was introduced.

The data engineer is the member on your team that has the skill set to focus on the practical applications of data collection and analysis. These form the foundational infrastructure for data scientists to build upon. At the end of the day, for a data science application like a recommendation engine to actually deliver value, it needs to be hooked up to the real-world operational back office of your web shop for example.

Our pitch on creating applications that make data scientist happier, is exactly in the domain of data engineering and how these applications support the data science process. But before we go into detail, we need to have a look at this process of value creation by our friends, the data scientists.

Let’s recap on the first part of the definition of data science:

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.

Since the rise of mainstream data science platforms, developing an AI project has become much more accessible to developers. A representation of the funnel that is typically followed to bring value is visualized above.

Data is collected. Ordered and cataloged. Connected. Deducted. And finally interpreted.

This funnel is the realm of the data scientist. Without going into details (since we are no experts in this domain) we can tell you this: the degree of success in automated learning (read: going from plain data into actionable insights and wisdom) in this funnel is typically dependent on the amount of qualitative data that flows into this funnel. It’s the process of this decade and the reason why the giants of this world like Google, Facebook and others provide us with free to use applications where data collection in the background is king.

While the process above is the focus of data scientists, the IN and OUT flow of this funnel is where a data engineer is paving the way to a successful product.

All software engineering is about data, but what makes you fall in the category of data engineering is the context in which that data is used. Data engineers are no different from more traditional software engineers in the core sense of their skills. The only difference is their take on how important data is with regard to the funnel above. The data engineer is in that sense more aware of the process that is to follow on data collecting.

Let’s now have a look at what we mean with a more traditional approach in the way we store our precious data and how a more modern design will help feed the funnel of the data scientist.

A traditional approach to storing data

The simplest, and therefore massively used strategy for handling data in a traditional system is in-place storage of data. The pattern is called CRUD, which stands for Create, Read, Update, Delete.

A CRUD application is a system where your domain entity, for example your shopping cart in a web shop, is represented with only its current state. This is simple and it’s all that is needed to bring the core value of a web shop to your customer: checking out this cart when clicking ‘order now’.

An overview of the typical data flow in a CRUD application:

When designing an application like this, we value the core business of selling products directly, but this approach is not necessarily the best we can do in optimizing a recommendation engine to promote other products your users may like. In other words, this pattern for data storage is not optimal for the funnel we described above, which is the foundation of a successful AI product.

A famous quote about data science that is very applicable here:

 Every time you perform an update or delete, a data scientist cries.

The reasoning behind this statement is that each time we update our most current representation of our entity, we lose value in the form of historic data. As we have seen earlier, the amount of qualitative data points is crucial to the success of the AI funnel. That is where a more modern approach to data storage can help coin all our data while still preserving the core functionality of our application like ordering products.

A new era for data-first applications

Meet event sourcing.

While CRUD applications rely on the last known state to represent the truth about an entity in your system, event sourced applications rely on all increments that make up for this end state. In that sense, CRUD only cares about the destination, while Event Sourcing cares about the road to the destination, as well as the destination.

Because we are no longer updating the state, but persisting the changes instead, we capture all valuable changes that were made to the entity, in this case the shopping cart. Reconstructing the end state is a matter of replaying the history of events and reconstructing the state.

Now let’s get back to our sobbing data scientists.

Applications that make data scientist happy

Event sourced applications are a huge advantage when traceability and data collection are key. As we have seen earlier, the more qualitative data points that can be fed into our data scientists funnel, the happier our data scientists will be. It will make their algorithms more successful in finding crucial patterns that can create the competitive advantage you were looking for.

It’s not trivial to convert a traditional CRUD application to an event sourced application without a strong set of tools, frameworks and expertise.

Want to learn about implementing event sourced applications? Have a look at some of our intro blogs about event sourcing.

Need some help? At Reaktika, we are focusing on event sourcing as a key pattern in building robust applications that are a solid foundation for AI projects to build upon. Let’s get in touch!