Data Mesh: Rethinking Our Data Infrastructure
Enterprise data warehouse (EDW) has been around for a long time. The story begins more than forty years ago, when Oracle invented the first database in 1979, which offered fragmented analytics at best if any. This problem was solved at the end of the 1980s, with EDWs emerging and consolidating in one server data gathered from multiple nodes. EDWs provided their users with unified analytics, serving them well for decades when the bulk of data was structured.
EDWs: Pros and Cons
EDWs make things easier for users as they keep the data pre-aggregated and sorted in a predefined format, which facilitates faster access. The data can easily be converted into business reports and give decision-makers visibility into where a company stands at a certain point in time. Another advantage EDWs offer concerns the historicity of data: One can dive into data stored in an EDW and come up with a time series analysis.
Despite these advantages, EDWs saw their popularity fade over the years, especially with the explosion of data format types and the emergence of cloud technology. Data grows three times as fast as data storage does in this new era. Most of the data today is unstructured, and it keeps piling up, giving users no chance to pre-aggregate it. The most one can do about this problem is to stash the data somewhere until it is needed. The data warehouse lacks the scalability to keep up with this dramatic growth in the amount of data produced. The system can only scale by adding new nodes to the central data warehouse, which gets more challenging as the number of nodes increases.
Another area where EDWs come up short is providing users with real-time data. With e-commerce booming, inventory management and logistics have become more important than ever. These operations rely on real-time data: Your three-month-old data won’t help you much when you are amidst a supply chain crisis.
Finally, there is the data ownership issue. The data stored in a data warehouse is static, waiting to be discovered by a user in need. Modifying that data in a way that can serve the needs of different users is a serious hurdle as it means reconfiguring the entire ETL/ELT process. More frequent data transformations mean more changes to the data pipeline. Treating data as a product and curating it as close to the source as possible can solve this problem.
Enter data mesh
One of the concepts proposed to revolutionize data integration is the “data mesh.” Coined by Zhamak Dehghani, the concept arises from the current circumstances caused by the monolithic data infrastructure developed over decades and describes how we should rethink data infrastructure.
Dehghani argues in a long essay that the first generation of data architecture platforms was built upon EDWs centralizing data storage. In the second generation, this architecture was complemented with data lakes that were capable of integrating to the big data ecosystem. According to Dehghani, while trying to avoid monoliths, this path led to an even bigger monolith, which made it difficult for people to bring together and use their data. The monolithic infrastructure suffered from scalability problems, which Dehghani hopes to solve by decomposing it into business domains. Her data mesh concept aims to tackle this structure and features three key principles:
1 - Domain focus
In the traditional monolithic approach, one central data warehouse handles the ingestion, transformation, and storage of data. The data mesh approach replaces this outdated view with a more distributed model that transfers the ownership of data from a central IT department to those who create it.
Conventionally, it is the central IT department that manages the different data pipelines of various departments, which can cause bottlenecks. The resultant delays are unacceptable in today’s dynamic business environment, where some domains can’t do without real-time data. On the other hand, a distributed data mesh approach empowers domain experts to handle their own data pipelines.
In addition to creating backlogs, the conventional approach is doomed to fail because IT teams know very little about the respective contexts in which each department generates its data. These teams are disconnected from business operations and are not well-versed in what business units use the data for. In the data mesh model, domain experts are responsible for generating, curating, and serving the data for consumption by others while ensuring the quality of data. In a sense, this shift resembles the emergence of citizen developers in the enterprise segment, who are expected to leverage no-code/low-code tools to solve their own problems at a time when IT talent is scarce.
2 - Data-as-a-Product
Enterprises have finally woken up to the value of data and started to treat it as their most valuable asset. Although this is a step in the right direction, it is still not good enough. The data mesh concept replaces this approach with a data-as-a-product perspective.
According to this perspective, domain experts are to produce and curate data, keeping in mind that it will be consumed by other people across the organization. Data owners are expected to treat their data as a product and potential users of that data as customers. Therefore, the needs of these people should be a priority, and data owners must strive to make the user experience as frictionless as possible, with an emphasis on delighting their customers. That’s why data owners should focus on ensuring that the data is accurate, timely, discoverable, and accessible to people who need it.
3 - Establishing a self-serve infrastructure
The third principle of the data mesh approach follows immediately from the product thinking in the second principle. The aim of the data-as-a-product is to help consumers of data self-serve. The data generated should be cataloged and published in such a way that other users should easily be able to find it and use it to achieve their own ends. Making the data more discoverable results in less hand-holding needed from the IT department.
Empowering consumers of data to self-serve takes some of the burden off the IT department and removes bottlenecks. Moreover, it has the potential to unlock the creativity of data users without engaging in a back-and-forth with the IT people at every step.
Final thoughts
Data mesh is a useful framework, a mental exercise offering us a fresh look at the data infrastructure, its problems, and what needs to be done about them. It is not a tool we can buy off-the-shelf and deploy, though. It is more of a vision about how we should treat data. Data mesh involves a set of technologies we can leverage to make the most of data and tap into its potential. It is not the last word on data infrastructure, but it shows how our thinking needs to evolve to tackle today’s data challenges.