While most data analysts were busy exploring the progression from relational databases to Cubes, analytic databases, and data lakes, another camp was looking into using data federation to integrate data for analysis.
A New Approach
Data federation allows analysts to instantly run queries that join multiple disparate databases without the need to copy or move data from the original operational sources to a central analytical repository. This approach is clearly a signiﬁcant improvement on all of its predecessors in terms of the immediacy with which data can be analyzed.
While the idea is sound and the value is self-evident, data federation alone isn’t scalable for large amounts of data, nor for large numbers of simultaneous users. In addition, because it relies heavily on the speed and stability of the source systems and network, its performance is commonly diminished for both data analysis and production operations. So, while data federation
is quick and ﬂexible, in itself, it is not scalable or particularly dependable. But, it was an important step in the right direction.
The next stage of evolution was to combine data federation with caching repositories to address these issues. This hybrid approach used big data solutions to complement data warehousing. The result is a combination of repositories, virtualization, and distributed processes for data management that delivers the best capabilities from several technologies but still falls short of the expectation for a robust, agile, and performant data warehouse. Caching can be problematic due to the need to schedule cache loads around performance concerns of source systems, as well as the fact that the cache is loaded into a single repository that may or may not be optimized for diﬀerent data sets and/or data types.
Still, in moving closer to modern data warehouses, virtual data technology is essentia. Nonetheless, data federation provided an important stepping stone towards data virtualization in the sense that it popularized the notion of virtual views, indices, and semantics. It also introduced the somewhat radical idea that data need not be physically copied or relocated before it is accessed. In addition, virtual views can be altered without the need to transform and reload data, as in earlier data warehouse integration approaches meaning the changes can be presented immediately, without waiting for the data to populate through an overnight process. It is the virtualization of data integration that enables extreme agility in analytical development and significantly reduces build times and costs, all of which leads us to the next breakthrough in data warehousing.