Ad hoc data changes are often a necessary evil within modern enterprise applications. Right-thinking development professionals hate them. They hate them because they’re dangerous, often implemented by people with only partial understanding and they’re addictive. A business can become addicted to resolving data issues with “quick” scripts. More often than not, hastily constructed, ad hoc data fix scripts introduce completely new side-affects.
In an Event Driven Architecture, data changes are distributed to interested parties via durable messaging. We like the service bus pattern for this. For any data entity, there still remains a single source of record, but there may be many consumers of that data, some of which will replicate the data locally in order to reduce coupling between the components.
I know many of you consider data replication bad, this is mainly because developers are still trained to be efficient with data storage. However, with the advent of cheaper resilient data storage and greater decoupling concerns, the pattern often becomes very attractive. Right now, the most common use of data replication is in reporting with data warehouses and data marts. With an Event Driven Architecture we find many opportunities for using it within our applications as well. For example, an order processing subsystem can “listen” to customer update messages in order to maintain a local copy of customer names and shipping addresses. Decoupling like this allows the customer system to shut down at any time, without affecting the processing of orders. However, the customer maintenance subsystem is still the source of record for all customer related data.
Ad-hoc data changes are not very compatible with a distributed data model like this. Using a SQL script to make changes to the data in the source of record can introduce data synchronization issues with those sub systems “downstream” listening to data changes made using the application. Their locally stored data can become out-of-sync with the source of record.
Traditionally, when a replicated data store becomes out-of-sync with the source of record, the replicated data store is often deleted and rebuilt from scratch. This is common with many ETL patterns. However, with very large or slowly changing data this may very inefficient and time consuming. What might be a more efficient pattern is for an application to be able to explicitly identify records that have been manually modified by SQL script. Thus when synchronization becomes relevant it is easy to identify what data needs to be redistributed downstream. A very simple implementation might be to add a true/false manual update column to every source of record table in a data store. Adding metadata columns to source of record tables is already a common pattern used in auditing and change tracking. The most common example being a “last modified” column for tracking when a record was last updated. This works for most data changes, except for deletes. Some applications never completely delete records merely mark them as deleted or migrate them to archive storage, but tracking manually deleted entities for the purposes of downstream synchronization is challenging. Whatever solution is adopted to track ad hoc data changes, a firm discipline is required by implementers to ensure manual modifications conform to the pattern. This implies very strong data governance is required for this mechanism to work, on the other hand, that’s largely true for any metadata columns in a table.
An alternative and possibly safer approach is to implement a change logging mechanism. Every action you take against the local data store is logged as a synchronization record. When synchronization is required, the synchronization records are played back to provide the synchronization changes.
Once it’s possible to identify manually modified data, it’s relatively simple to redistribute this data. My own preference is for the sync-request pattern. In this pattern, downstream data consumers periodically request a digest of all manually changed data entities, from the source of record, beginning from some point in time (typically since the last such request). The source of record queries and composes the data for all the manually changed entities from the requested time and then “publishes” them back via messages. This pattern can be implemented synchronously (request/response) or asynchronously. Implementing the mechanism synchronously is probably easier but implementing it asynchronously has more potential.
An alternative to the sync-request pattern is a sync-publish pattern. Periodically a source of record will query for entities manually modified after the last such query, publishing the resulting entity data via messages which can be consumed by any interested subsystems downstream. This has the benefit of being less complex but relies on the source of record to periodically publish changes, with the prior pattern the source of record can be induced to synchronize its data on-demand. For flexibility, it might be wise to construct a hybrid implementation supporting both patterns.
The more I consider these problems, the more I believe a message based synchronization correction mechanism becomes a necessity. There are additional uses for this pattern outside of resolving ad hoc data changes.