We’re continuing our series on data replication types. In this blogpost, we’re talking about batch replication and its use cases.
What is batch replication?
The purpose of Batch Replication is to append all data from the source view/table to the target table with the option of preventing duplicate records. Batch replication is easier to implement than incremental replication (but less flexible) and more powerful than complete replication. It also differs from both them in a numerous ways:
- Old rows in the target table can be kept even if they were deleted in the source table
- Updates can be performed on current rows, and inserted rows will be added to the target table
- No timestamp columns are needed to get new rows because all rows will be retrieved from the source
- You can decide whether or not to allow duplicates in case there are updates on the rows
- You can also decide how to name the target table, and there will be no different stages to distinguish, just one table containing the data
Use Case for Batch Replication
The option to delete duplicate records comes in handy if you work with an identity field and don’t want to track the changes. For instance, you have a list of customers with various information such as name, age, CustomerID, and date of birth. You want to update all information except for the age column because it is not important for you. With batch replication you can exactly do that.
If on the other hand you are working with financial transactions you might not want to delete any old information when you copy new information to the database so you can track the history of a particular account. This is again something that you can do through batch replication.
Batch Replication on the Data Virtuality Platform
On the Data Virtuality Platform, you can do batch replication in two ways.
Without the use of an Identity Field:
- When you don’t use an identity field, all data from the source view/table is appended to whatever exists in the target table. This also happens each time the job runs.
- This method is only useful if the data in the source table or view always has updates or new information (for example today’s Google Analytics data). Otherwise, there will be a lot of replicated rows in the target table and it would be hard to use this data in a meaningful manner.
With the use of an Identity Field
- To avoid the duplicates mentioned above in the first case, a column with an unique identifier (Primary Key) needs to be determined. This will help you to keep a clear record of the changes, whilst also allowing for new rows to be added.
- The process with the use of an Identity Field works differently now.
- On the first run, the whole table is retrieved from the source with the complete data.
- On subsequent runs, all data which doesn’t have a unique Identity Field is deleted from the target table and not from the source table because otherwise the data in the analytical storage will be stale.
- This ensures that all retrieved data can be added to the target table without creating duplicates.
How can Data Virtuality help you?
Our platform uniquely combines data virtualization and data replication. This gives data teams the flexibility to always choose the right method for their specific requirement. It provides high-performance real-time data integration through smarter query optimization and data historization as well as master data management through advanced ETL in a single tool.