Occasionally I’ll get questions from people who have been going down the CQRS path about why I’m so against data duplication. Aren’t the performance benefits of a denormalized view model justified, they ask. This is even more pronounced in geographically distributed systems where the “round-trip” may involve going outside your datacenter over a relatively slow link to another site.
CQRS
As his been said several times before by many others, it’s not the denormalized view model that defines CQRS.
One of the things that sometimes surprising people after going through my course is that in most cases you don’t need a denormalized view model, or at least, not the kind you think. Yes, that’s right: MOST cases.
But I don’t want to get too deep into the CQRS thing in this post – that can wait.
SOA
The big thing I’m against is raw business data being duplicated between services.
Data that can be expected to be accessible in multiple services includes things like identifiers, status information, and date-times. These date-times are used to anchor the status changes in time so that our system will behave correctly even if data/messages are processed out of order. Not all status information necessarily needs to be anchored in time explicitly – sometimes this can be implicit to the context of a given flow through the system.
For example, the Amazon.com checkout workflow.
In that flow, if you provide a shipping address that is in the US, you are presented with one set of options for shipping speed, whereas an international address will lead you to a different set of options.
Assuming that the address information of the customer and the shipping speed options are in different services, we need to propagate the status InternationalAddress(true/false) between these services in that same flow. In this case, there isn’t a need to explicitly anchor that status in time.
But what’s so bad about duplication of data between services?
The danger is that functionality ultimately follows raw business data.
You start with something small like having product prices in the catalog service, the order service, and the invoice service. Then, when you get requirements around supporting multiple currencies, you now need to implement that logic in multiple places, or create a shared library that all the services depend on.
These dependencies creep up on you slowly, tying your shoelaces together, gradually slowing down the pace of development, undermining the stability of your codebase where changes to one part of the system break other parts. It’s a slow death by a thousand cuts, and as a result nobody is exactly sure what big decision we made that caused everything to go so bad.
That’s the thing, it wasn’t viewed as a “big decision” but rather as just one “pragmatic choice” for that specific case. The first one excuses the second, which paves the way for third, and from that point on, it’s a “pattern” – how we do things around here; the proverbial slippery slope.
So what’s with the word “Replication” in the title of this post?
While data duplication between services is very dangerous, replication of business data WITHIN a service is perfectly alright.
Let’s get back into multi-site scenarios, like a retail chain that has a headquarters (HQ) and many stores. Prices are pushed out from the HQ and orders are pushed back from the stores according to some schedule.
We know that we can’t guarantee a perfect connection between all stores and the HQ at all times, therefore we copy the prices published from the HQ and store them locally in the store. Also, since we want to perform top-level analytics on the orders made at the various stores, that would be best done by having all of those orders copied locally at the HQ as well.
We should not view this movement of data from one physical location to another as duplication, but rather as replication done for performance reasons. If there were some magical always-on zero-latency network that existed, we wouldn’t need to do any of this replication.
And that’s just the thing – logical boundaries should not be impacted by these types of physical infrastructure choices (generally speaking). Since services are aligned with logical boundaries, we should expect to see them cross physical boundaries – this includes SYSTEM boundaries (since a system is really nothing more than a unit of deployment).
I know that you might be reading that and thinking “What!?” but there isn’t enough time to get into this in any more depth here. You can read some of my previous posts on the topic of SOA for more info here.
Cross-site integration without replication
There are some domains where sensitive data cannot be allowed to “rest” just anywhere. Let’s look at a healthcare environment where we’re integrating data from multiple hospitals and care providers. While all of these partners are interested in working together to make sure that patients get the best care, which means that they need to share their data with each other, they don’t want any of THEIR data to remain at any partner sites afterwards (and are quite adamant about this).
In these cases, the decision was made that performance is less important than data ownership. Personally, I don’t agree with this mindset. The fact that data is “at rest” in a location as opposed to “in flight” does not change ownership. It could be stored in an encrypted manner so that only a certain application could use it, resulting in the same overall effect, but this is an argument that I’ve never won.
People (as physical beings) put a great deal of emphasis on the physical locations of things. It’s understandable but quite counterproductive when dealing with the more abstract domain of software.
In closing
By virtue of the fact that we don’t duplicate raw business data between services, that means that the regular data structures inside a service already look very different from what they would have looked like in a traditional layered architecture with an ORM-persisted entity model.
In fact, you probably wouldn’t see very many relationships between entities at all.
Going beyond that, you probably wouldn’t see the same entities you had before. An Order wouldn’t exist the way you expect; addresses (billing and shipping) would be stored (indexed by OrderID) in one service whereas the shipping speed (also indexed by OrderId) would be in another, and the prices may well be in yet another.
It is in this manner that data does not end up being duplicated between services, but rather is composed by many services whether that is in the UI of one system, the print-outs down by a second system, or in the integration with 3rd parties done by a third system.
If performance needs to be improved, look at having these services replicate their data from one physical system to another – in-memory caching is one way of doing this, denormalized view models might be though of as another (until you realize there isn’t very much normalization within a service to begin with).
And a word from our sponsor ![:-)]()
For those of you on “rewrite that big-ball-of-mud” projects looking to use these principles, I strongly suggest coming on one of my courses. The next one is in San Francisco and I’ve just opened up the registration for Miami.
For those of you on the other side of the Atlantic, the next courses will be in Stockholm in October and in London this December.
The schedule for next year is also coming together and it will include South Africa and Australia too.
Anyway, here’s what one attendee had to say after taking the course earlier this month:
I wanted to thank you for the excellent workshop in Toronto last week. I spent the better part of the weekend reflecting over what was presented, the insights we learned through the group exercises, and how my preconceptions of SOA have changed. By the end of the course, all the tidbits of (usually) rather ambiguous information that I’ve collected from various blogs, books, and other sources, finally coalesced into something more intelligible – one big A-HA moment if you will. Overall, I found the content of the workshop to be incredibly enlightening and it left me feeling invigorated and excited to learn more.
– Joel from Canada
Hope you’ll be able to make it.
If travel is out of the question for you, you can also look at get a recording of the course here.
One final thing
If your employer won’t foot the bill for these, please get in touch with me.
I wouldn’t want you not to be able to come just because you’re paying out of pocket.
There are very substantial discounts available.