CQRS Reads Fail and its Solution

Why Your CQRS Reads Fail (and How to fix this Read-After-Write Consistency in Production) (~8-10 minutes read)

TL;DR

In a CQRS architecture, writes go to a Write DB and reads happen from Read Replicas. If reads are faster than writes due to replication delays, users might get missing data. Solutions include:

Adding a small delay between writes and reads
Writing to cache first, then falling back to DB
Reading from the Write DB for immediate reads (anti-pattern but practical)
Using synchronous calls (rarely used due to latency)

Additionally, moving DB awareness out of the API server using a Proxy SQL Server simplifies API responsibilities but adds network hops, latency, cost, and operational complexity. System design is always a trade-off—pick the solution that best balances consistency, performance, and maintainability.

DETAILS

Generally, how you design database access in production is that you have a write DB and N read DBs. Once write happens, it gets replicated to the N read DBs and all the reads happen from those N read DBs.

Figure: Sample of a simple CQRS Read/Write Paths

Some nomenclatures used

CQRS: Command Query Responsibility Segregation Pattern
DML: Data Manipulation Language (such as: UPDATE/ALTER etc)
DDL: Data Definition Language (such as: CREATE TABLE etc.)
Query: such as (SELECT * etc)
Write DB: Also called a Master DB in a master-slave architecture

Quick summary of the figure

When a user makes a query or takes any action that causes any DML or DDL operations by the APIs, then:

If it’s a query then the API server will query the Read replicas.
If it’s a DML or DDL operation, then the API server will fire this against Write DB.

Now, this is a well-known pattern called “CQRS” or “Command Query Responsibility Segregation” Pattern. Important parts of CQRS pattern:

The Write path will be different than that of a Read path
Well, here what we have is a very simple path (for clarity I haven’t used any other components like Kafka etc that are generally used in each path)

Question in concern here

Now, what will happen if the user is trying to access the data that they have not written to the DB yet!?

Well, in normal cases:

This will throw a “Record Not Found” exception or 404 or .. you get the idea. If the data was never “written” to the DB, then no worries – this is expected behavior.
But the fun part is, what if the user has already fired a “write” statement and hence they are expecting to read it and then work with the data. That’s when things break!
Before we move ahead, can you think of any reasons why this might be happening?
1. 1. What I’ve found happening in real-life is that, sometimes, the write may fail temporarily due to a lost connection to the Write DB or the Write DB reached a MAX_CONNECTION (if no Thread Pool was used) or the machine was cut-off from internet just before firing the ‘write’ query or something other than these.
  2. Of course, retry mechanisms will kick-in in such cases but the point is, there’s a delay in getting the data to be written to the Write DB.
  3. As a result of that delay, the replication to the Read Replicas is delayed.
  4. As a result of that, when the API queries the Read Replica, no data is found!

potential solutions

Depending on the type of the application, different solutions may be used:

Use Delay
1. 1. Have a delay in-between the writes and the reads
  2. Now, this will be helpful only if both read and write happen from the same code (meaning from the same function, mostly)
  3. And if this isn’t the case, mostly this will fail
Use Caching
1. 1. Write to cache first and let read happen from cache first
  2. If a cache miss happens (during read), then contact Read Replica DB. Here, owing to cache checking, automatically some delay will happen and if DB write + replication happen quickly, then Read DB will have the data before the API queries the data there.
  3. But this is also not fullproof. If the DB write + replication take longer than Read DB query then also the data will not be available
Use Synchronous calls
1. 1. This will surely work but almost never used
  2. Because this is a blocking call. And in a global distributed architecture, a blocking call is something you look to avoid as much as possible.
  3. This is because it halts the app and this doesn’t provide a good user experience.
  4. And again, what if the blocking call fails, you’ll have to handle those gracefully – will you retry? If yes, again the user will be blocked! If not, then the user's data is lost!
  5. You get the idea why this isn’t the go-to approach
Use Write DB to Read as well
1. 1. Here, as soon as you write, if you need to query the data immediately, then a useful approach is to query the write DB itself.
  2. Yes, it’s an anti-pattern for CQRS but in such cases this works beautifully!
  3. Of course, all the subsequent reads should be done from the read replicas. And this to be done only when a write is followed by an immediate read.

For me, I’ve used both solution-2 (Use Caching) and solution-4 (Use Write DB to Read as well). The use case and the scenario will guide you to choose the correct solution for that situation.

Remember: There is nothing “best” in system design 🙂

Bonus Read: Proxy SQL Server in CQRS

Fun Fact about the above CQRS strategy

In the diagram above, you see that the API server knows which server is the write DB and which servers are the Replica DBs
This is surely one of the ways to setup or configure your API server
The problem of this design is that, here your API server needs to be always aware of this Database configuration – and this isn’t really the responsibility of an API server
Hence, this design violates the “Single Responsibility Principle” for API servers. And whenever a new DB is added or replaced, your API server needs to be updated too (meaning the connection strings in the API server needs to be updated).
So, how to fix this?
Well, this is where the Proxy SQL Server comes in. A proxy SQL server, as the name suggests, acts like a database server for the APIs. APIs know only these proxy servers and don’t actually know who’s the write DB or who are replica DBs.
This way, your API server isn’t worried about any changes happening in the database layer and hence it adheres to the “Single Responsibility Principle” too.
Any changes happening in the database layer will be known to the proxy sql server and it will then identify and route the queries to the read replicas and ddl/dmls to the write servers.

Here’s how it looks in reality:

Figure: CQRS with Proxy SQL Server

Note: though it’s called “Proxy SQL Server”, it doesn’t mean Microsoft’s SQL server in any way. It simply means that this server acts as a proxy to the underlying database layer. If you want, you can call it “Proxy DB Node” as well.

Now, can you think of any reasons why not to use this?

Well, the reason is increased network hop and potential impact on latency!
Whenever you add new architectural components, it adds more network nodes to touch and this might increase latency and API’s response, which may violate API’s SLAs.
Of course another important point comes in that, whenever you add a new component, you’ll have to monitor its health, heart-beats, design for backups if it’s down – all these increase the complexity of the DevOps team or the Platform team.
One other point is cost – the more components you use, the more you’ll need to pay to the cloud provider. And not only for that component alone, your network usage (ingress + egress) both increase.
Hence many times we actually do go with violating the Single Responsibility Principle knowingly – in order to provide lower latency and be within the API SLA limits.

Final thoughts

Ultimately, the right approach depends on your system’s consistency needs, performance goals, cost, and tolerance for complexity – CQRS (or any other pattern for that matter) is about balancing these trade-offs, not chasing a one-size-fits-all solution.

Page updated

Google Sites

Report abuse