CData Acquires Data Virtuality, Modernizing Enterprise Data Virtualization

Article contents

Share Article

A short guide on data replication types and use cases. Batch Replication

Article contents

We’re continuing our series on data replication types. In this blogpost, we’re talking about batch replication and its use cases.

What is batch replication?

The purpose of Batch Replication is to append all data from the source view/table to the target table with the option of preventing duplicate records. Batch replication is easier to implement than incremental replication (but less flexible) and more powerful than complete replication. It also differs from both them in a numerous ways:

  • Old rows in the target table can be kept even if they were deleted in the source table
  • Updates can be performed on current rows, and inserted rows will be added to the target table
  • No timestamp columns are needed to get new rows because all rows will be retrieved from the source
  • You can decide whether or not to allow duplicates in case there are updates on the rows
  • You can also decide how to name the target table, and there will be no different stages to distinguish, just one table containing the data
Batch Replication with identity field
Batch Replication without identity field

Use Case for Batch Replication

The option to delete duplicate records comes in handy if you work with an identity field and don’t want to track the changes. For instance, you have a list of customers with various information such as name, age, CustomerID, and date of birth. You want to update all information except for the age column because  it is not important for you. With batch replication you can exactly do that.

If on the other hand you are working with financial transactions you might not want to delete any old information when you copy new information to the database so you can track the history of a particular account. This is again something that you can do through batch replication.

Batch Replication on the Data Virtuality Platform 

On the Data Virtuality Platform, you can do batch replication in two ways.

Without the use of an Identity Field:

  • When you don’t use an identity field, all data from the source view/table is appended to whatever exists in the target table. This also happens each time the job runs.
  • This method is only useful if the data in the source table or view always has updates or new information (for example today’s Google Analytics data). Otherwise, there will be a lot of replicated rows in the target table and it would be hard to use this data in a meaningful manner.

With the use of an Identity Field

  • To avoid the duplicates mentioned above in the first case, a column with an unique identifier (Primary Key) needs to be determined. This will help you to keep a clear record of the changes, whilst also allowing for new rows to be added. 
  • The process with the use of an Identity Field works differently now.
    • On the first run, the whole table is retrieved from the source with the complete data.
    • On subsequent runs, all data which doesn’t have a unique Identity Field is deleted from the target table and not from the source table because otherwise  the data in the analytical storage will be stale.
    • This ensures that all retrieved data can be added to the target table without creating duplicates. 

How can Data Virtuality help you?

Our platform uniquely combines data virtualization and data replication. This gives data teams the flexibility to always choose the right method for their specific requirement. It provides high-performance real-time data integration through smarter query optimization and data historization as well as master data management through advanced ETL in a single tool.

More interesting articles

Data Virtuality brings enterprise data virtualization capabilities to CData, delivering highly-performant access to live data at any scale.
Discover how integrating data warehouse automation with data virtualization can lead to better managed and optimized data workflows.
Discover how our ChatGPT powered SQL AI Assistant can help Data Virtuality users boost their performance when working with data.
While caching offers certain advantages, it's not a one-size-fits-all solution. To comprehensively meet business requirements, combining data virtualization with replication is key.
Explore the potential of Data Virtuality’s connector for Databricks, enhancing your data lakehouse experience with flexible integration.
Generative AI is an exciting new technology which is helping to democratise and accelerate data management tasks including data engineering.