What is Data Virtualization?
Data virtualization enables businesses to access, manage, integrate and aggregate data from disparate sources independent from its physical location or format in real-time.
According to The Data Management Association International (DAMA) Data Management Body of Knowledge (DMBOK), data virtualization is defined as the following:
“Data Virtualization enables distributed databases, as well as multiple heterogeneous data stores, to be accessed and viewed as a single database. Rather than physically performing ETL on data with transformation engines, Data Virtualization servers perform data extract, transform and integrate virtually.”
Why Data Virtualization Evolved
In today's fast-changing business world, information became an actual production factor and data-driven decision-making an inevitable tool to withstand the growing competition across global industries and markets. Exploiting the power of BI/analytics and automating workflows is one way for companies to open new revenue streams while reducing costs by improving the efficiency of their daily processes.
And here lies the challenge. Nowadays, enterprise data is stored in different locations and comes in various, rapidly evolving forms such as:
- Relational and non-relational databases like MySQL, Amazon Redshift or MongoDB
- Flat files like XML, CSV or JSON
- Social Media or Website data like Facebook, Twitter or Google Analytics
- CRM/ERP data like SAP, Oracle or Microsoft Dynamics
- Cloud/Software-as-a-Service applications like Netsuite, Salesforce or Mailchimp
- Data lakes and Enterprise Data Warehouses
- Big Data
Businesses are faced with increasing volumes of data accompanied by growing data variety and velocity. This ultimately leads to further challenges like achieving trustworthy data quality, time efficiency in data management and self-service capabilities for data users. Overcoming these challenges efficiently and effectively became crucial for modern enterprises' success.
Forrester and Gartner confirm that data virtualization has become a critical asset to any enterprise looking to cope with the increasing data challenges.
“Through 2022, 60% of all organizations will implement data virtualization as one key delivery style in their data integration architecture."
Gartner Market Guide for Data Virtualization, November 16, 2018
How Data Virtualization Works
The centerpiece of a data virtualization application is the so called virtual or semantic layer. It enables data or business users to manipulate, join and calculate data independently from their format, source and physical location, no matter if it is on-premises or in the cloud. While all connected data sources and associated metadata appear in one single user interface, the virtual layer further allows to organize data in different virtual schemas and virtual views. Users can easily enrich the raw data from the source systems with their business logic and prepare the data for analytics, reportings or automation processes.
Ideally, this virtual layer also covers data governance and metadata exploration capabilities. It is important to note that this functionality is not included in every tool. For example, with a sophisticated user based permission management, the virtual layer creates a single source of truth across the whole organization in a fully compliant and secure manner. That way, authorized users can access all relevant data from one single point in one tool. As a result, the creation of data silos is avoided.
Data virtualization normally does not persist data from the source systems (in contrast to simple data store replicator like traditional ETL tools, more about this in a later chapter). It simply stores metadata to feed the virtual views and enables the creation of individual integration logics. One key aspect of data virtualization is the ability to deliver the integrated data in real-time to any front-end or application such as business intelligence (BI) tools, microservices or custom programs. This works by fetching the data in real-time from the underlying source systems.
Main Advantages of Data Virtualization
Using data virtualization for integrating business data from disparate sources brings a number of benefits:
- Through immediate data access, all data can be integrated in no time without a lot of technical knowledge or manual coding effort.
- All desired information is readily available for all kinds of reporting or analysis tools, greatly accelerating and improving decision making.
- Real-time accessibility differentiates data virtualization from other integration approaches and allows rapid prototyping.
Flexibility and simplicity
- Rapid prototyping allows to quickly test processes before implementing them into productive environments.
- Since all data sources appear unified in one interface, data virtualization hides the underlying complexity of a heterogeneous data landscape.
- The virtual layer allows to deploy business logics and easily adjust to changing needs.
- In contrast to conventional data warehouses, no comprehensive infrastructure is needed as data can be just kept in their source systems. This approach is usually cheaper than the traditional ETL way where data has to be transformed into certain formats before it can be physically moved to the storage.
- A change in data sources or front-end solutions does not result in an expensive and complex restructuring but can be accomplished without major efforts.
- With data virtualization, existing (legacy) infrastructures can be integrated and combined with new applications effortlessly. Thus, there is no need for costly replacements but it virtually breaks silos by acting as a middleware between all systems.
Consistent and secure data governance
- Having just one central access point to all data, instead of multiple access points to the systems of each department, enables better user- and permission management and full GDPR compliance.
- KPIs and rules can be defined centrally to ensure a company-wide understanding and usage of the most important metrics.
- Global metadata information helps to secure high quality and provides a better understanding of enterprise data through data lineage (if provided by the tool) and metadata catalogs. Mistakes can be detected and resolved quicker, compared to other data integration approaches.
Shortcomings and Doubts About Data Virtualization
Besides the benefits, there are several shortcomings and doubts:
- There are claims about performance issues. Some even think that the performance of data virtualization is poor by definition, because the data virtualization technology is built to access only the source production systems, and not a data warehouse or some other database which is designed and optimized for reporting.
- Data virtualization alone is not capable of historicizing data. A data warehouse or analytical database is needed for this in general which isn’t part of the original concept of data virtualization.
- Data cleansing and/or data transformation is still a complex task in the virtual layer.
- Changes to the virtual data model are associated with increased efforts. Before a change can be fully applied, it has to be accepted by all consuming applications and users.
- The original requirement of data virtualization is to be able to retrieve data using a single query language, get speedy query response, and to quickly assemble different data models or views of the data to meet specific needs. However, this is still widely unfulfilled.
Data Virtualization Vendors
Data Virtuality is a data integration platform for instant data access, easy data centralization and data governance. The Data Virtuality Logical Data Warehouse solution combines the two distinct technologies, data virtualization and data replication, for a high-performance architecture.
IBM Cloud Pak for Data, formerly known as IBM Cloud Private for Data, is a data and AI platform that helps to collect, organize and analyze data, while utilizing data virtualization.
Denodo offers a data virtualization engine with an associated data catalog feature, thus enabling users to not just combine but also identify and structure existing data.
Informatica PowerCenter is an enterprise data integration platform with features like archiving data out of older applications or an impact analysis to examine structural changes before implementation.
TIBCO’s data virtualization product contains a business data directory to help users with analyses and a built-in transformation engine for unstructured data sources.
Data Virtualization vs. ETL
From data integration perspective, there are more possibilities out there. Extract, Transform, and Load (ETL) is a very common form. With ETL technologies, data is normalized and replicated in the target storage. This type of data integration is very suitable for bulk movement of data and helps businesses migrate data from legacy systems to a new application or simply populate their data warehouse. A classic enterprise data architecture often uses ETL to move the data from different sources to the data warehouse.
Considering the speed and flexibility demanded in today's business world, this solution can't meet the requirements. Preparing and moving the data in the data warehouse before it can be used slows down the whole process and the data is outdated before it can be used for analysis and reporting. Also, real-time access to data isn't possible.
Data virtualization is distinct from ETL technologies. As data can prepared and used in real-time and doesn't have to be moved, this technology is much faster and requires less resources. New data sources can be flexibly added to the virtual layer without significant work. However, data virtualization cannot scale. Many data virtualization providers work with caching features to compensate for this issue. But this is also only a temporary relieve and cannot be support all use cases, e.g. where data historization is needed.
The ideal solution is the combination of all three technologies, realized in the Logical Data Warehouse.
The Logical Data Warehouse
The Logical Data Warehouse perfectly combines data virtualization, caching, and materialization and thereby enables breakaway flexibility and performance. It represents an entirely new paradigm in the way we think, manage, and work with data. Usually, the Logical Data Warehouse can be used with a single query language such as SQL to enable speedy query response and quick assembling of different data models or views of the data to meet specific needs. Physical data integration is also a robust feature of the Logical Data Warehouse that ensures fast query response while decoupling performance from the source data stores and moving it to the logical data warehouse repository. In this manner, the effort-intensive physical transfer of the data is minimized and simplified, effectively removing lengthy data movement delays from the critical path of data integration projects. The final result is easy data access without fundamentally changing the existing environment.
Renowned Gartner institute states:
"Technical professionals should adopt the Logical Data Warehouse as the modern, "next-gen" data warehouse that uses a multiengine approach to fulfill conflicting demands.” (Adopt the LDW Architecture to Meet Modern Analytical Needs, April 2018, Gartner)
The variety of features offered by the different providers differs quite a bit and it constantly evolves. Today's innovative features include query pushdown, query optimization, caching, job automation, data lineage, metadata catalogs and AI capabilities, to name a few.
How the Logical Data Warehouse works
1. CONNECT YOUR DATA SOURCES
The Logical Data Warehouse (LDW) connects to multiple data sources and allows querying data from there by using SQL. Data sources can either be relational or non-relational, the source file's format does not matter.
2. CREATE A CENTRAL DATA LOGIC
The LDW enables you to integrate your data and create a central data logic that covers the business logic and the logical connections between the different systems. This layer can easily be implemented by using SQL.
3. MAKE YOUR DATA ACCESSIBLE
Finally, the LDW supports the standard interfaces (JDBC, ODBC, REST) to deliver data to the data consumers. This could be reporting tools, advanced analytics tools, or custom programs in various programming languages.
Logical Data Warehouse Use Cases
In general, one can say the business value of the combined platform is larger than the mere sum of the individual technologies. That is ultimately why the Logical Data Warehouse is a perfect fit for handling a vast number of use cases.
Big Data, Predictive Analytics and Machine Learning (ML)
In order to apply machine learning models on the transactional and operational data, companies have to deal with huge amounts of data. Furthermore, all the data usually lies in different sources, which makes the challenge even more complex. Since data is the main ingredient from which machine learning algorithms are trained, traditional data warehousing systems are inefficient and difficult to scale. They would hang up on copying and pasting the data from CSV files into a central place and formatting the data even before starting building the predictive models. Logical Data Warehouse helps to integrate the needed data faster and more efficient. To achieve real-time functionality, companies must combine the traditional data warehouse with modern big data tools, often combining multiple technologies. Unifying these data sources into one common view provides instant access to a 360-degree view of your organization.
Recent changes led by digital transformation and increasing regulatory requirements put the industry under pressure to reduce costs and become more efficient in an era where data is world's most valuable resource. For data architects and data management offices, the Logical Data Warehouse can help to break data silos and create a future-proof single source of truth. In the past, enterprises tried to build a single source of truth in Hadoop/data lake. But they created even more data silos and data fragmentation. With Logical Data Warehouse, different operative as well as analytical data silos can be joined. The so joined data are now accessible in a central data access layer by various data consumers on business side like data analysts, data scientists,... All while data governance data lineage and data security are in place. For more digitally driven financial services, the real-time connection can be used to enable real-time loan approval processes. Speed is an important factor when it comes to the loan approval process, since providing quicker loan approvals gives competitive advantage to the financial institutions. In the Logical Data Warehouse, credit scores and (pre-)approval decisions can be calculated with the help of procedural SQL. And during the customer meeting, loan applications can be directly processed online and evaluated. Further use cases for financial services can be found here: Logical Data Warehouse enables a flexible data supply chain for financial services institutions.
Risk Data Management
In the aftermath of the financial crisis, a stricter regulatory approach to capital adequacy and risk management has created an increasingly intricate reporting environment for the financial services industry around the globe. Several new regulations such as BCBS 239 and Solvency II were introduced. Faced with numerous challenges, financial institutions need a reporting architecture that is both scalable and flexible and provides a transparent control mechanism in a strong data quality framework. To realize all these requirements, the Logical Data Warehouse can be the architectural foundation in modern approaches.
Customer Data Platform (CDP)
Understanding customers’ and prospects’ behavior as individual entities is the key idea of a comprehensive data-driven marketing concept, called Customer Data Platform. Built well, these platforms will integrate all customer-related data on an individual level in one single place. In this way, marketers have the data foundation to understand each customer, improve audience segmentation and campaign planning, streamline cross-channel marketing orchestration and optimize analytics efforts. The Logical Data Warehouse can be the foundation of such a state-of-the-art CDP.
E-commerce is one of the most competitive industries that exist. In order to keep the competition behind an effective and trusted decision-making is required. Data virtualization enables businesses to generate a 360° view of all data and processes by centralizing all product, customer, and marketing data in one single source of truth. It helps to get deeper insights quickly without limitations and can improve cross-channel attributions through more insightful data. For example, a typical e-commerce business has an ERP system, CRM, web and mobile apps, email analytics programs, online marketing, social media marketing, and other tools. With a Logical Data Warehouse, all these data sources can be joined quickly and flexibly to provide comprehensive views of any data related to customers, products, etc.
Virtual Data Mart
A virtual data mart A logical data warehouse makes it easy to create a virtual data mart for expediency. By combining an organization’s primary data infrastructure with auxiliary data sources relevant to specific, data-driven, business units, initiatives can move forward more quickly than if data would need to be on-boarded to a traditional data warehouse.
Modern agile businesses like to experience with new business ideas and models - mostly backed up by data to both implement the initiative and to measure the success. Therefore, a flexible system is needed to test, adjust an implement new ideas. With the Logical Data Warehouse in use, the data virtualization component can be used for quick setup and easy iterating and data materialization capabilities to easily move data to production as needed. The built-in recommendation engine analyzes the usage of the prototypical data and makes suggestions on how to optimally store the data for productive use, including automatic database index creation and other optimizations.