How Dream11-Style Causal Inference Architecture Works

Introduction

Most product teams understand A/B testing. Ship two variants, randomly split users, measure the winning metric, and keep the better version.

But a company like Dream11 cannot answer every important business question with a clean A/B test.

You can A/B test a new contest card. You can A/B test an onboarding flow. You can A/B test payment copy or ranking logic.

But you cannot randomly assign users to deposit failures just to measure what happens next.

That is where a causal inference platform becomes useful. It gives product, growth, data, and engineering teams a way to estimate cause and effect from observational data, especially when controlled randomization is not possible, not ethical, or too risky.

This blog explains how a Dream11-style backend and infra architecture could work at a high level.

It is not a claim about Dream11's exact internal implementation. It is a theoretical reconstruction of the kind of architecture needed to support feature flags, experimentation, payment optimization, and causal inference at fantasy sports scale.

The Core Problem

Dream11 operates in a high-frequency, high-stakes consumer environment.

Users join contests, deposit money, pick teams, react to match schedules, receive notifications, and make decisions under time pressure. A small backend decision can change user behavior immediately.

Examples:

Which payment provider should handle this deposit?
Should this user see a retry flow after a failed payment?
Does a deposit failure reduce future contest participation?
Does a contest recommendation model increase wallet spend?
Does a feature work differently in Mumbai compared to rural Rajasthan?

Some of these questions can be answered through A/B tests. Others cannot.

The architecture therefore needs two tracks.

The first track is real-time experimentation. This is used when the platform can safely randomize users.

The second track is causal inference. This is used when the platform needs to reason from real user behavior that already happened.

High-Level Architecture

At a high level, the system has six major parts:

The important idea is that the real-time product system and the offline inference system are connected through events.

Every decision, exposure, payment attempt, contest join, wallet spend, and failure becomes data. That data is later used to understand what actually caused business impact.

API Gateway

The API Gateway is the entry point for user-facing requests.

It receives traffic from mobile apps and web clients, then forwards requests to backend services such as payment, contest, wallet, recommendation, and user services.

In this architecture, the gateway does not own experimentation logic. It simply carries request context forward.

Typical context includes:

user ID
region
device type
app version
language
signup age
wallet state
traffic source

This context matters because experimentation and causal inference are not only about averages. A feature may help one user segment and hurt another.

For example, a payment retry flow may improve conversion for urban Android users but confuse low-bandwidth users in smaller towns. If the platform only looks at the global average, it may miss that difference.

Feature Flag Service

The Feature Flag Service is the runtime control plane.

When a backend service needs to make a decision, it asks the Feature Flag Service:

text
1For this user, in this context, what should happen?

The answer may be simple:

text
1upi_retry_enabled = true

Or it may be experimental:

text
1contest_pricing_experiment = reduced_entry_fee

Or it may be adaptive:

text
1payment_route_provider = provider_b

This service is important because product decisions should not be hardcoded across many services. A central flag service lets teams change behavior safely without redeploying every backend.

Decision Layer

The Decision Layer sits behind the Feature Flag Service.

It decides which kind of logic should be used for a request.

There are usually three types of decisions.

1. Static Config Flags

Static config flags are simple operational switches.

Examples:

enable a retry flow
disable a risky provider
show a new contest format
change a threshold

These are useful for rollout, rollback, and operational safety.

They are not enough for experimentation because they do not automatically create clean treatment and control groups.

2. A/B Experiment Engine

The A/B Experiment Engine handles controlled randomization.

For each experiment, users are deterministically assigned to variants.

For example:

text
150% -> control
250% -> new contest card

The assignment must be stable. If a user is placed in the treatment group today, they should not randomly move to control tomorrow.

The platform also records an exposure event whenever the user actually experiences the experiment.

This distinction matters.

Being eligible for an experiment is not the same as seeing it. A user may be assigned to a variant but never open the relevant screen. Good experimentation systems measure exposure, not just assignment.

3. Bandit Engine

A bandit engine is used when the system should learn and adapt while traffic is running.

Payment routing is a good example.

Suppose Dream11 has multiple payment providers. Provider A may perform better during normal traffic. Provider B may perform better during match spikes. Provider C may work better for a specific bank or UPI app.

A static A/B test is often too slow for this kind of operational decision.

Instead, a non-stationary multi-armed bandit can continuously shift traffic toward the provider with better real-time success rates.

The bandit loop looks like this:

text
1choose provider -> observe success/failure -> update weights -> choose again

This is not the same as a normal A/B test.

An A/B test tries to measure the truth cleanly. A bandit tries to maximize outcomes while learning.

That tradeoff is powerful, but it also means bandit data needs careful interpretation. Because traffic allocation changes over time, it can introduce bias if treated like a simple randomized experiment.

Payment Service

The Payment Service is one of the most important backend services in this architecture because payments directly affect revenue and user trust.

When a user tries to deposit money, the Payment Service may ask the Feature Flag Service which route to use.

Example:

text
1user starts deposit
2payment service asks for provider decision
3feature flag service returns provider_b
4payment service sends request to provider_b
5payment service records success or failure
6event is published to Kafka

The critical events are:

deposit attempted
provider selected
deposit succeeded
deposit failed
retry shown
retry succeeded
wallet credited

These events become the raw material for both experimentation and causal inference.

Game and Contest Service

The Game or Contest Service owns contest participation.

It records events such as:

contest viewed
contest joined
team created
wallet spent
match entered
contest abandoned

This service provides downstream outcome data.

For a causal question like:

text
1What is the impact of deposit failure on future spending?

the treatment event comes from the Payment Service, but the outcome may come from the Game, Contest, or Wallet services.

That is why event consistency matters. If services emit events with inconsistent user IDs, timestamps, or metadata, causal analysis becomes unreliable.

Kafka Event Bus

The Kafka Event Bus is the nervous system of the architecture.

Every important backend action is published as an event.

The event bus decouples real-time services from analytics systems.

The Payment Service does not need to know how causal inference works. It only needs to emit clean events.

The Causal Inference Platform does not need to call the Payment Service directly. It reads from the data platform.

A typical event contains:

text
1event_id
2event_type
3user_id
4timestamp
5service_name
6experiment_key
7variant
8properties

Good event design is what makes the rest of the system possible.

If the platform does not know who received which treatment, when it happened, and what happened afterward, it cannot estimate impact.

Data Warehouse

The Data Warehouse stores historical product behavior.

Kafka events flow into the warehouse through ingestion jobs or streaming pipelines.

The warehouse keeps data such as:

payment attempts
payment failures
contest joins
wallet spend
experiment exposures
feature flag decisions
user sessions
device and region metadata

This is where analysts and causal jobs can query large windows of user behavior.

For experimentation, the warehouse helps answer:

text
1Did treatment outperform control?

For causal inference, the warehouse helps answer:

text
1Among similar users, what changed after one group experienced a treatment?

Those are related questions, but they are not the same.

Feature Store

The Feature Store converts raw events into user-level features.

Raw events are too granular for causal matching. The platform needs summarized covariates that describe users before the treatment happened.

For the deposit failure example, useful features may include:

deposit attempts in the last 30 days
deposit success rate before treatment
wallet spend before treatment
contest joins before treatment
preferred payment method
device type
region
app version
account age
language

These features are used to compare treated users with similar untreated users.

This is the heart of observational causal inference.

The system is trying to answer:

text
1If this user had not experienced a deposit failure, what would likely have happened?

Since we cannot replay reality, we build a credible comparison group.

Causal Inference Platform

The Causal Inference Platform is the offline analysis system.

It lets a PM or analyst define a causal question through a low-code interface or API.

Example:

text
1Treatment: deposit_failed
2Outcome: wallet_spent_7d
3Population: users who attempted a deposit
4Time window: last 30 days
5Covariates: region, device, prior spend, prior joins, account age
6Method: propensity score matching

The platform then builds treatment and control groups.

Treatment group:

text
1Users who experienced a deposit failure

Control group:

text
1Similar users who attempted a deposit but did not experience a failure

The hard part is similarity.

Users who experience payment failures may not be random. They may be from different regions, banks, devices, app versions, or network conditions. They may also behave differently before the failure.

If the platform simply compares failed users against all successful users, the estimate may be biased.

That is why the platform uses matching or adjustment.

Propensity Score Matching

Propensity score matching is one common way to build a better control group.

The platform estimates each user's probability of receiving the treatment based on pre-treatment features.

In this example:

text
1probability of experiencing deposit failure

Then it matches treated users with untreated users who had similar propensity scores.

Simplified flow:

text
1collect pre-treatment features
2estimate probability of treatment
3split users into treated and untreated groups
4match similar users
5compare downstream outcomes
6estimate treatment effect

The final result may look like:

text
1Users with deposit failures spent Rs 18.70 less over the next 7 days compared with matched users.

This is not magic. It is still an estimate.

The quality depends on whether the platform captured the important confounding variables. If a hidden factor affects both deposit failure and wallet spend, the result can still be biased.

But it is much better than comparing raw averages.

Counterfactual Estimation

Causal inference is really about counterfactuals.

The platform is asking:

text
1What would have happened to these same users if the deposit failure had not occurred?

We cannot observe that alternate reality directly.

So the system constructs a proxy using similar users who did not receive the treatment.

This is why the control group is so important.

Bad control group:

text
1All users without deposit failures

Better control group:

text
1Users with similar prior behavior, region, device, payment intent, and account history who attempted a deposit but succeeded

The better the control group, the more credible the causal estimate.

Causal Impact Report

After matching and estimation, the platform produces a report.

A useful report should include:

treatment definition
outcome definition
number of treated users
number of matched control users
estimated impact
confidence interval
segment-level breakdown
data quality warnings
method used
assumptions

For example:

text
1Question:
2What is the impact of deposit failure on 7-day wallet spend?
3
4Finding:
5Matched users who experienced deposit failure spent Rs 18.70 less over the next 7 days.
6
7Interpretation:
8Deposit reliability likely has a measurable downstream revenue impact.

The report should not only give a number. It should explain how credible the number is.

Dashboard and PM Decision

The dashboard is where the causal result becomes a product decision.

A PM or analyst may use the report to decide:

prioritize payment retry flows
shift traffic away from weak payment providers
improve observability for failed deposits
create region-specific fallback flows
build a user recovery campaign after failed deposits

This is where causal inference becomes operational.

The goal is not to produce academic analysis. The goal is to make better product and business decisions when A/B testing is not enough.

Why This Is Different From Only A/B Testing

A/B testing answers questions like:

text
1What happened when we randomly assigned users to variant A or variant B?

Causal inference answers questions like:

text
1What was the likely impact of an event or condition that we could not randomly assign?

Both are valuable.

A mature experimentation platform uses both.

A/B testing is stronger when clean randomization is possible.

Causal inference is useful when the system must reason from observational data.

Bandits are useful when the system needs to optimize continuously while learning.

Together, they create a dual-track experimentation system.

Why Heterogeneous Treatment Effects Matter

India is not one uniform user base.

Users vary by:

state
language
city tier
network quality
device quality
payment method
sports preference
income pattern
digital comfort

A feature that works in Mumbai may fail in rural Rajasthan.

A payment provider that performs well for one bank may perform poorly for another.

A contest recommendation that improves engagement for power users may confuse new users.

This is why average treatment effect is often not enough.

The platform needs segment-level analysis.

Instead of only asking:

text
1Did this work overall?

the better question is:

text
1For whom did this work, where did it fail, and why?

That is where causal inference and feature-rich data infrastructure become powerful.

The Big Architectural Lesson

The interesting part of this architecture is not one specific algorithm.

The interesting part is the separation of responsibilities.

Real-time services make product decisions quickly.

The event bus records what happened.

The warehouse stores historical behavior.

The feature store turns raw events into comparable user profiles.

The causal inference platform estimates impact when randomization is not available.

The dashboard turns analysis into product action.

This separation lets the company move fast without losing the ability to learn.

Final Takeaway

Dream11-style inference architecture is not just an A/B testing system.

It is a decision intelligence system.

Feature flags control behavior.

Experiments measure randomized changes.

Bandits optimize live operational choices.

Causal inference estimates the impact of things that cannot be randomized.

For a high-scale fantasy sports platform, that combination matters because product, payments, and game behavior are deeply connected. A payment failure is not just a failed transaction. It can change trust, contest participation, wallet spend, and long-term retention.

The architecture exists to answer one hard question again and again:

text
1What actually caused the user or business outcome we care about?

That is the real job of a causal inference platform.