Shard from Day 1

There is a database migration every backend engineer quietly fears.

The one where a system that was happily running on one database suddenly becomes too big for it.

At that point, the problem is no longer just "move data into multiple databases". The real problem is that the whole application was written with one hidden assumption: there is only one database.

Queries assume it. Jobs assume it. Admin dashboards assume it. Other services quietly start depending on it. Someone adds a direct read because it was faster. Someone else caches data owned by another service because it reduced one API call. Nothing looks wrong in the beginning.

Then scale arrives.

And now sharding is not a database project. It is a company-wide rewrite.

This is why PhonePe's "shard from Day 1" rule is interesting.

Not because everyone should create ten database shards before they even have users. That would be theatre. The useful idea is simpler: design the service as if it will need to shard. The application should already know what its shard key is. The APIs should already carry that key. The service should already route a request to one logical shard. The code should not be full of queries that only work when all data lives in one place.

Take an orders service.

A weak API is:


GET /orders

It looks clean, but in a sharded system it is unclear where this request should go. Which database has the orders? One shard? All shards? Should we fan out to every shard and merge results?

A better API is:


GET /users/{user_id}/orders

Now the request has a shard key.

The service can do something boring and predictable:


shard = hash(user_id) % number_of_shards

Then it talks to exactly one shard.

That boring detail is the architecture.

A good sharded system is not one where every query is magically distributed. It is one where most important user-facing queries are not distributed at all. They are routed directly to the place where the data lives.

This is also why "no scatter-gather in the user path" is such an important rule.

Scatter-gather means asking many shards for data, collecting the results, and merging them. It is useful for reports, analytics, admin tools, reconciliation, and background jobs. It is a bad default for user-facing requests.

If one payment request has to wait for twenty shards, the request is now only as fast as the slowest shard. If one shard is overloaded, the whole API becomes slow. If one shard is unavailable, the user path becomes fragile.

So the design should separate the two worlds.

The user path should be single-shard, small, predictable.

The reporting path can be async, slower, and allowed to fan out.

This single separation prevents a lot of pain.

The next rule is ownership.

If Orders owns the orders database, other services should not query that database directly. They should call Orders or consume events from it.

Direct database access feels harmless at first. It saves an API call. It is quick. It works.

But it creates a hidden contract.

Now another service depends on your table names, columns, indexes, and sometimes even your accidental behaviour. Later, when you want to change schema or shard layout, you discover that your "private" database was never private.

The same problem appears with caching.

If another service caches data owned by Orders, it is now responsible for knowing when that data changes. Most teams underestimate how quickly this becomes messy. Cache invalidation stops being a local problem and becomes a distributed coordination problem.

So the rule is simple: the owner service owns the data and the cache for that data.

Other services can ask. They cannot quietly become second owners.

The common sharding library is the last important piece.

You do not want every team inventing its own sharding logic. One team hashes user IDs. Another uses ranges. Another hardcodes shard mappings. Another adds special cases. Six months later, nobody knows why user 123 goes to shard 7 in one service and shard 2 in another.

A common sharding library keeps this boring and consistent.

It knows how to pick a shard. It knows how to return the right connection. It hides the routing details from the business logic.

The service should not be thinking about database topology in every handler. It should ask the sharding layer one question:


Where does this user's data live?

That is enough.

There are tradeoffs, of course.

Sharding makes operations harder. Schema migrations now run across shards. Backups are per shard. Monitoring has to show hot shards, slow shards, connection pressure, data imbalance, and replication lag. Picking the wrong shard key can hurt badly.

This is why "shard from Day 1" should not be read as "make everything complex from Day 1".

It should be read as "do not bake single-database assumptions into the product".

The difference is important.

A small team can still run one physical database. But the application can be written as if there is a routing layer. Today the routing layer can return shard 0. Tomorrow it may return shard 0, 1, 2, or 3.

That is a much easier evolution than discovering years later that half the product assumes global queries and cross-service database reads.

The real lesson is not MySQL. It is not even sharding.

The lesson is discipline.

Every request should know where its data lives. Every service should own its own data. Every user-facing query should avoid unnecessary fan-out. Every shortcut that crosses ownership boundaries should be treated as debt.

At small scale, these rules feel strict.

At large scale, they feel obvious.

The best architecture is often like that. Boring in the beginning, priceless when the system grows.