Important Trends in Data Management
STORY INLINE POST
As we creep deeper into 2023, it bears remembering how just less than a decade ago, the enterprise consumer market looked upon data solutions – mostly data warehousing – as an unavoidable fixture of their solution architecture. A monolith designed to defray cost and performance impacts to core business operations in order to satisfy rigid reporting and analytics needs. More often than not, these solutions were seen as dependent cost centers requiring significant maintenance budgets for limited innovation return. It was just the cost of doing business.
And since that time, concepts of and relating to infinite computing, elastic infrastructure, and managed services in the cloud have revitalized an entire demographic’s perspective on the value of data. This new, imbued value of data features prominently in most C-suites’ five and 10-year strategies as a source of revenue generation, with data now being assigned extrinsic value.
This pivot on data-as-a-strategy has hardly happened in a vacuum; the emergence of commoditized computing domains, including artificial intelligence, machine learning, IIoT, and graph-led product categories, have had the effect of both pulling data architectures into the future as well as seeing the pace of data architecture innovation driving advances in those same product markets.
This has given rise to a Cambrian explosion of greenfield technologies and start-ups, fresh solution verticals, and re-invented processing architectures that saw north of $5 billion invested into the space in 2021 alone – the majority of this being thrown into the analytics and storage ring.
Given this abbreviated review, we’re also deeply interested in diving deeper into where the data space finds itself headed. We’ve marked that outlook with five key trends we suspect will be core to the evolution of enterprise data management over the next half-decade.
1. Ubiquitous Cloud Data Infrastructure
There is no better place to start than with the infrastructure that enabled much of the growth in this space. Moving beyond legacy on-premises systems to the cloud and specifically to the public cloud, unlocked otherwise tied-up resources dedicated to infrastructure maintenance, reliability, and availability and leveled the playing field for innovative practices. The enticing new low-floor and high-ceiling paradigm for technology adoption is poised to gain more traction, with Gartner forecasting public cloud services spending to approach US$500 billion by 2022.
With five nines of availability (99.999%) and a staggering 11 nines of durability (99.999999999%) achieved by AWS (the public cloud incumbent servicing a third of the market), less time and resources can be spent on managing on-premise systems. This benefit is realized both in hardware capital expenses or the arguably more costly army of human resources in the form of specialists tending to the networking, administration, data management, security, reliability and maintenance.
From both a data management and storage perspective, cloud-native storage platforms built on new and emerging architectures, such as cloud data warehouses, cloud data lakes and the new but familiar-feeling data lakehouses, provide performant and easily scalable solutions.
On the other hand, the abundance of infinitely scalable cloud computing, serverless cloud services, and turnkey cloud-native integration tools fosters a healthy and rich ecosystem to address enterprise data management needs.
2. Active and Augmented Metadata Management
The data that helps describe your data – metadata – represents the fundamental key of being able to create leverage over astronomical volumes of organizational data capture. As a pillar of the data cataloging space, Enterprise Metadata Management (EMM) strategy is self-evident in driving timely and efficient indexing strategies to help address common needs, including:
What data am I gathering/generating?
How is it structured?
Where is it coming from and where is it stored?
Where do I find the data that I need?
How does my data relate to my business processes?
How is my data connected?
Where is my data being used and by whom?
A basic implementation of EMM is the operational data catalog, which represents an indexed collection of the enterprise data sources. Going a step further is the concept of augmented data catalogs coined by Gartner, and defined as a machine learning-driven automation layer on top of the traditional data catalog.
The automation in augmented data catalogs enables streamlining data discovery, connectivity, metadata enrichment, organization, and governance. Building on this automated architecture, Active Metadata Management (AMM) is a leap in the same direction, enabling the continuing analysis of the various dimensions of enterprise metadata to determine “the alignment and exceptions between data as designed versus operational experience,” as defined by Gartner.
3. Data Lakehouses – Best of Both Paradigms
While the data lake helped address the storage and flexibility pieces of the data management puzzle, enterprises find themselves in need of resolving to external ETL processing for performant business intelligence insights and reporting, something that can be typically managed out of the box in the case of a data warehouse. To streamline this process and help keep the data infrastructure unified and self-contained, the concept of data lakehouses emerged. As the name suggests, it is a hybrid data management solution combining advantages from both data lakes and data warehouses into a single platform, thereby reducing complexity and maintenance while also leveraging the economy of scale. The first documented use of the term “Data Lakehouse” dates back to 2017 when it was first used by Jellyvision Lab, a Snowflake customer who used the term to describe the Snowflake platform.
Similar to data lakes, mixed-structured data can be ingested into the lakehouse with the differentiating aspect being the ability to add a layer of warehousing on top of the lake. This allows for leveraging the rigidity and organized structure of a warehouse for the traditional reporting needs while still maintaining an underlying flexible lake and versatile architecture for a wider range of other applications.
4. Data Quality Management Through Observability
As the technical data infrastructure continues to be commoditized, the modern data production system is becoming increasingly complex with multiple potential points of check (or failure). Consequently, the answer to the seemingly simple question of “what went wrong?”, or in the preventative sense, “how can we make sure nothing goes wrong?”, in a data pipeline becomes harder to address. Fortunately, the wheel of quality management in such complex settings did not have to be reinvented. Lessons learned from applying lean and agile methodologies to software development, giving rise to the DevOps revolution that continues to evolve and mature, are also now being applied to enterprise data management. And one of the key pillars to ensuring total and continuous data quality management is data observability.
Observability itself is not a new concept; it was first introduced in 1960 by Rudolf E. Kalman in the context of linear dynamic systems. In the context of Control Theory, Observability was defined as the degree to which the internal state of a given system can be inferred based on its outputs. Simply put, it provides the answer to the simple question of “what can we tell about how a system is performing based on its output?”
In the context of data management, the generally accepted definition of data observability involves the ability to understand the health and state of data in your system, allowing for data quality assurance and data life-cycle monitoring and control. While software engineering has pillars of software observability (logs, metrics, and traces), data observability is theorized to be based on five pillars: freshness, distribution, volume, schema, and lineage.
5. Data Fabric as a Multimodal Data Framework
It’s clear that a central monolithic data management solution is no longer an option for modern enterprises. The myriad of data producers, consumers, and application and services in between require a modern and comprehensive data management framework capable of sustaining its growth in complexity and scale.
Data fabric lays the foundation for a multimodal data management platform architecture that elevates data management design and practices. Data fabric is rooted in three key principles:
Cohesiveness: ensuring that the enterprise data management architecture is curated and orchestrated in a way that breaks down organizational and technical silos and unifies data management under a single platform.
Composability: supporting component flexibility, scalability, and extensibility.
Versatility: on the level of applications as well as users and interfaces.
It bears mentioning that this overview of emerging trends in enterprise data management pertains primarily to the technical and architectural aspects of enterprise data management. But as we have historically observed with other spaces and industries, the explosive growth in technical capabilities is only one piece of realizing the business potential in the space. Sustainable growth and adoption of these trends in the enterprise space is contingent on adopting and implementing the right organizational change management strategies and having the right technical and organizational resources to support them.