How PhonePe Built Clockwork: A Distributed Scheduler for Billions of Delayed Jobs

Most backend engineers start with cron.

Need to send a reminder after 30 minutes? Add a cron.

Need to expire coupons at midnight? Add a cron.

Need to retry failed callbacks? Add another cron.

This works when the system is small. But at PhonePe scale, the problem changes completely.

PhonePe is not scheduling a few background jobs. It has workloads like merchant settlements, coupon expiry, retries, reminders, and callback-based workflows. These jobs need to run at the right time, survive service crashes, scale horizontally, and avoid duplicate execution as much as possible.

That is why PhonePe built Clockwork: a distributed, durable, fault-tolerant task scheduler.

The simplest way to think about Clockwork is this:

A product service should only say, "Run this callback at this time." Clockwork should handle persistence, scanning, queueing, retrying, and delivery.

The Problem With Cron At Scale

Cron is simple because one machine owns the schedule.

But modern backend systems do not run on one machine.

A service may run across 100 containers. If every container runs the same cron, the same job can fire 100 times. Then you need distributed locks. Then you need leader election. Then you need persistent state. Then you need retry tracking. Then you need failure recovery.

At that point, every service is slowly becoming its own scheduler.

That is the real problem.

Not cron itself.

The problem is every product team building scheduling logic inside their own service.

PhonePe solved this by pulling scheduling out of product services and moving it into a centralized platform: Clockwork.

What Clockwork Does

Clockwork is not a random background worker.

It is a platform that accepts delayed jobs and executes HTTP callbacks at the scheduled time.

A client service sends something like:


json
1{
2  "callback_url": "https://service.phonepe.com/callback",
3  "payload": {
4    "merchant_id": "m_123",
5    "type": "settlement_reminder"
6  },
7  "execute_at": "2026-05-30T12:30:00Z",
8  "retry_policy": {
9    "max_retries": 5,
10    "strategy": "exponential_backoff"
11  }
12}

Clockwork stores this job durably, waits until it becomes due, pushes it to a queue, and then workers execute the callback.

The product service does not need to run cron.

The product service does not need to scan future jobs.

The product service does not need to manage distributed worker coordination.

It only needs to expose a callback endpoint and make that endpoint idempotent.

Component 1: Job Acceptor

The first component is the Job Acceptor.

This is the client-facing API.

Its job is simple:

Accept the scheduling request.
Validate the callback URL, payload, execution time, and retry policy.
Assign a partition ID.
Persist the job in HBase.
Return an acknowledgement to the client.

The important part is that the job is persisted before the client gets success.

If the Clockwork process crashes after returning success, the job should not disappear. The durable store becomes the source of truth.

This is the first place where Clockwork becomes different from simple in-memory schedulers.

In-memory scheduling is easy until the process dies.

Durable scheduling is what makes this design production-grade.

Component 2: HBase As The Job Store

Clockwork stores scheduled jobs in HBase.

Why HBase?

Because this workload is naturally time-indexed.

The scheduler repeatedly asks one question:

Give me all jobs whose scheduled execution time is less than or equal to now.

That is a scan-heavy access pattern.

HBase stores rows sorted by row key, which makes start-key and stop-key scans efficient when the row key is designed around execution time and partitioning.

A simplified row key can be imagined like this:


client_id | partition_id | execution_timestamp | job_id

This lets Clockwork scan a bounded time window instead of randomly searching the entire dataset.

The design is not "run a query over all jobs."

The design is:

Partition jobs carefully, then scan only the due slice.

That difference matters at scale.

Component 3: Job Extractor

The Job Extractor is the part that continuously finds jobs that are ready to run.

It scans HBase for jobs where:


scheduled_time <= current_time

Once it finds eligible jobs, it does not execute the callback directly.

That is a very important design choice.

The extractor only extracts due jobs and publishes them to RabbitMQ.

Why?

Because HTTP callbacks can be slow.

A downstream service may be overloaded. A network call may timeout. A client may return 500. If the scanner itself waits for callback execution, scanning throughput collapses.

So Clockwork separates the system into two paths:


Extraction path: find due jobs quickly
Execution path: call callbacks separately

This keeps scanning fast and predictable.

Component 4: ZooKeeper And Partition Ownership

If multiple Clockwork instances are running, they cannot all scan the same partition.

Otherwise, the same job may be picked by multiple instances and executed more than once.

This is where ZooKeeper comes in.

Clockwork instances register themselves with ZooKeeper. For every client, a leader is elected. That leader assigns partitions across available Clockwork workers.

Example:


Client A has 64 partitions
Clockwork has 4 active instances
Each instance gets 16 partitions

Now each partition has one owner at a time.

This gives two benefits:

The system can scan many partitions concurrently.
The same partition is not scanned by multiple workers at the same time.

This is the core coordination layer.

The goal is not to put a heavy distributed lock around every job.

The better design is to partition the work so that each scanner owns a clear slice of the job space.

Component 5: RabbitMQ As The Execution Buffer

After jobs are extracted, they are pushed to RabbitMQ.

RabbitMQ acts as the buffer between due-job discovery and callback execution.

This matters because scanning and execution have different performance profiles.

Scanning is storage-heavy.

Execution is network-heavy.

If you couple them tightly, slow callbacks slow down the scheduler.

If you decouple them with a queue, the extractor can keep scanning while workers process callbacks independently.

The queue also gives natural backpressure.

If workers cannot keep up, queue depth increases. Clockwork can slow down publishing or pause scans when queue size crosses a threshold.

That is a much safer design than blindly dumping millions of due callbacks into workers.

Component 6: Stateless Callback Workers

Workers consume jobs from RabbitMQ and execute HTTP callbacks.

They do not own scheduling state.

They do not decide which jobs are due.

They only receive executable work and call the client URL.

This makes workers easy to scale horizontally.

If callback traffic increases, add more workers.

If a worker crashes, RabbitMQ can redeliver unacknowledged messages.

But this also means callback handlers must be idempotent.

A callback may be retried.

A message may be redelivered.

A downstream service may process the request but fail before returning success.

So the receiving service should use a job ID or idempotency key to make duplicate callbacks safe.

In financial systems, this is non-negotiable.

The scheduler can reduce duplicate execution, but the business service must still protect itself.

Why This Is Not Exactly Once

A lot of engineers casually say "exactly once."

In distributed systems, that is dangerous wording.

Clockwork can be designed to reduce duplicate execution using partition ownership, acknowledgements, retry policies, and idempotency.

But once queues, HTTP calls, crashes, and retries are involved, the safer mental model is:

At-least-once delivery with strong duplicate protection and idempotent callback handling.

That is the practical design most reliable systems use.

The scheduler should try very hard not to duplicate work.

The client service should still assume duplicates are possible.

This is how you build correctness in layers.

Retry Strategy

Callbacks fail.

That is normal.

A downstream service may be temporarily down. A network call may timeout. A client may return a non-success response.

Clockwork handles this using client-specific retry policies.

A retry policy may define:


max retries
backoff strategy
jitter
callback timeout
relevancy window
DLQ behavior

A common retry flow looks like this:

Worker calls callback.
Callback fails.
Retry count is incremented.
Next execution time is calculated using exponential backoff.
Job is scheduled again.
If retries are exhausted, job moves to DLQ.

The important thing is that retries should not hammer the downstream service.

Exponential backoff and jitter prevent retry storms.

Without jitter, thousands of failed jobs may retry at the same exact second and overload the recovering service again.

Dead Letter Queue

A DLQ is not just a failed-jobs dustbin.

It is an operational safety net.

When a job fails too many times, the system should stop retrying blindly and move it to a place where engineers can inspect it.

A DLQ helps answer:


Which jobs are failing repeatedly?
Which client is affected?
What error did the callback return?
Can this job be replayed safely?
Should it be dropped?

For payment or settlement-related workflows, silent failure is dangerous.

A DLQ makes failure visible.

Relevancy Window

Not every delayed job is useful forever.

Imagine a notification that was supposed to go out at 10:00 AM.

If the system is overloaded and the callback becomes ready at 4:00 PM, sending it may be useless or even harmful.

That is why Clockwork supports the idea of a relevancy window.

A client can define how long a callback remains meaningful.

If the callback expires, Clockwork can skip it instead of sending stale work.

This is important for user-facing systems.

Reliability is not only about eventually doing the work.

Sometimes reliability means knowing when not to do stale work.

Backpressure

One of the easiest ways to break a queue-based system is to publish faster than consumers can process.

At small scale, this looks fine.

At large scale, the queue grows endlessly, memory pressure increases, latency goes up, and recovery becomes painful.

Clockwork avoids this by monitoring queue depth and using rate limiting.

If RabbitMQ has too much backlog, scans can be paused or slowed down.

When consumers catch up, publishing resumes.

This is a key production lesson:

A scheduler should not only ask "what is due now?" It should also ask "can the execution layer handle this now?"

Why This Design Scales

Clockwork scales because every major responsibility is separated.


Job Acceptor accepts and persists jobs
HBase stores scheduled state durably
Job Extractor scans due jobs
ZooKeeper coordinates partition ownership
RabbitMQ buffers executable jobs
Workers execute callbacks
DLQ stores exhausted failures
Metrics track lag and reliability

Each layer can scale independently.

If scheduling requests increase, scale Job Acceptors.

If due-job scanning increases, add partitions and extractor capacity.

If callback traffic increases, scale workers.

If downstream clients slow down, queue depth and rate limits protect the system.

This is good architecture because the bottleneck is not hidden.

The system exposes pressure points clearly.

What We Can Learn From This Design

The interesting part of Clockwork is not that PhonePe used HBase or RabbitMQ.

The interesting part is the separation of responsibilities.

Most teams start with this:


Product service owns job creation
Product service owns scheduling
Product service owns execution
Product service owns retries
Product service owns failure handling

That becomes messy very quickly.

Clockwork changes it to:


Product service owns business logic
Clockwork owns scheduling infrastructure
Queue owns buffering
Workers own callback execution
Client callback owns idempotency
Observability owns failure visibility

This is the real architecture lesson.

At scale, you do not want every service reinventing delayed jobs.

You want one reliable scheduling platform that every team can use.

Prototype Version

For a local prototype, we do not need HBase or ZooKeeper immediately.

We can build a smaller version with:


Go services
PostgreSQL as durable job store
RabbitMQ as queue
Scheduler using SELECT FOR UPDATE SKIP LOCKED
Worker service for callbacks
Callback demo service
Docker Compose for local setup

The prototype flow becomes:

API receives job.
Job is stored in PostgreSQL.
Scheduler scans due jobs.
Scheduler claims jobs using database locking.
Scheduler publishes to RabbitMQ.
Worker consumes from RabbitMQ.
Worker calls callback service.
Success marks job completed.
Failure schedules retry.
Repeated failure moves job to DLQ.

This gives the same mental model without needing PhonePe-scale infrastructure.

Final Mental Model

A distributed scheduler is not just cron with more machines.

It is a coordination problem.

It needs durable storage, partition ownership, queue-based decoupling, retry policy, idempotency, backpressure, and failure visibility.

That is why Clockwork is an interesting system.

It turns delayed job execution into a platform.

And once scheduling becomes a platform, product teams can stop thinking about cron and start thinking only about business logic.

That is the real win.