Introduction
Most product teams understand A/B testing. Ship two variants, randomly split users, measure the winning metric, and keep the better version.
But a company like Dream11 cannot answer every important business question with a clean A/B test.
You can A/B test a new contest card. You can A/B test an onboarding flow. You can A/B test payment copy or ranking logic.
But you cannot randomly assign users to deposit failures just to measure what happens next.
That is where a causal inference platform becomes useful. It gives product, growth, data, and engineering teams a way to estimate cause and effect from observational data, especially when controlled randomization is not possible, not ethical, or too risky.
This blog explains how a Dream11-style backend and infra architecture could work at a high level.
It is not a claim about Dream11's exact internal implementation. It is a theoretical reconstruction of the kind of architecture needed to support feature flags, experimentation, payment optimization, and causal inference at fantasy sports scale.
The Core Problem
Dream11 operates in a high-frequency, high-stakes consumer environment.
Users join contests, deposit money, pick teams, react to match schedules, receive notifications, and make decisions under time pressure. A small backend decision can change user behavior immediately.
Examples:
- Which payment provider should handle this deposit?
- Should this user see a retry flow after a failed payment?
- Does a deposit failure reduce future contest participation?
- Does a contest recommendation model increase wallet spend?
- Does a feature work differently in Mumbai compared to rural Rajasthan?
Some of these questions can be answered through A/B tests. Others cannot.
The architecture therefore needs two tracks.
The first track is real-time experimentation. This is used when the platform can safely randomize users.
The second track is causal inference. This is used when the platform needs to reason from real user behavior that already happened.
High-Level Architecture
At a high level, the system has six major parts:
The important idea is that the real-time product system and the offline inference system are connected through events.
Every decision, exposure, payment attempt, contest join, wallet spend, and failure becomes data. That data is later used to understand what actually caused business impact.
API Gateway
The API Gateway is the entry point for user-facing requests.
It receives traffic from mobile apps and web clients, then forwards requests to backend services such as payment, contest, wallet, recommendation, and user services.
In this architecture, the gateway does not own experimentation logic. It simply carries request context forward.
Typical context includes:
- user ID
- region
- device type
- app version
- language
- signup age
- wallet state
- traffic source
This context matters because experimentation and causal inference are not only about averages. A feature may help one user segment and hurt another.
For example, a payment retry flow may improve conversion for urban Android users but confuse low-bandwidth users in smaller towns. If the platform only looks at the global average, it may miss that difference.
Feature Flag Service
The Feature Flag Service is the runtime control plane.
When a backend service needs to make a decision, it asks the Feature Flag Service:
text1For this user, in this context, what should happen?
The answer may be simple:
text1upi_retry_enabled = true
Or it may be experimental:
text1contest_pricing_experiment = reduced_entry_fee
Or it may be adaptive:
text1payment_route_provider = provider_b
This service is important because product decisions should not be hardcoded across many services. A central flag service lets teams change behavior safely without redeploying every backend.
Decision Layer
The Decision Layer sits behind the Feature Flag Service.
It decides which kind of logic should be used for a request.
There are usually three types of decisions.
1. Static Config Flags
Static config flags are simple operational switches.
Examples:
- enable a retry flow
- disable a risky provider
- show a new contest format
- change a threshold
These are useful for rollout, rollback, and operational safety.
They are not enough for experimentation because they do not automatically create clean treatment and control groups.
2. A/B Experiment Engine
The A/B Experiment Engine handles controlled randomization.
For each experiment, users are deterministically assigned to variants.
For example:
text150% -> control 250% -> new contest card
The assignment must be stable. If a user is placed in the treatment group today, they should not randomly move to control tomorrow.
The platform also records an exposure event whenever the user actually experiences the experiment.
This distinction matters.
Being eligible for an experiment is not the same as seeing it. A user may be assigned to a variant but never open the relevant screen. Good experimentation systems measure exposure, not just assignment.
3. Bandit Engine
A bandit engine is used when the system should learn and adapt while traffic is running.
Payment routing is a good example.
Suppose Dream11 has multiple payment providers. Provider A may perform better during normal traffic. Provider B may perform better during match spikes. Provider C may work better for a specific bank or UPI app.
A static A/B test is often too slow for this kind of operational decision.
Instead, a non-stationary multi-armed bandit can continuously shift traffic toward the provider with better real-time success rates.
The bandit loop looks like this:
text1choose provider -> observe success/failure -> update weights -> choose again
This is not the same as a normal A/B test.
An A/B test tries to measure the truth cleanly. A bandit tries to maximize outcomes while learning.
That tradeoff is powerful, but it also means bandit data needs careful interpretation. Because traffic allocation changes over time, it can introduce bias if treated like a simple randomized experiment.
Payment Service
The Payment Service is one of the most important backend services in this architecture because payments directly affect revenue and user trust.
When a user tries to deposit money, the Payment Service may ask the Feature Flag Service which route to use.
Example:
text1user starts deposit 2payment service asks for provider decision 3feature flag service returns provider_b 4payment service sends request to provider_b 5payment service records success or failure 6event is published to Kafka
The critical events are:
- deposit attempted
- provider selected
- deposit succeeded
- deposit failed
- retry shown
- retry succeeded
- wallet credited
These events become the raw material for both experimentation and causal inference.
Game and Contest Service
The Game or Contest Service owns contest participation.
It records events such as:
- contest viewed
- contest joined
- team created
- wallet spent
- match entered
- contest abandoned
This service provides downstream outcome data.
For a causal question like:
text1What is the impact of deposit failure on future spending?
the treatment event comes from the Payment Service, but the outcome may come from the Game, Contest, or Wallet services.
That is why event consistency matters. If services emit events with inconsistent user IDs, timestamps, or metadata, causal analysis becomes unreliable.
Kafka Event Bus
The Kafka Event Bus is the nervous system of the architecture.
Every important backend action is published as an event.
The event bus decouples real-time services from analytics systems.
The Payment Service does not need to know how causal inference works. It only needs to emit clean events.
The Causal Inference Platform does not need to call the Payment Service directly. It reads from the data platform.
A typical event contains:
text1event_id 2event_type 3user_id 4timestamp 5service_name 6experiment_key 7variant 8properties
Good event design is what makes the rest of the system possible.
If the platform does not know who received which treatment, when it happened, and what happened afterward, it cannot estimate impact.
Data Warehouse
The Data Warehouse stores historical product behavior.
Kafka events flow into the warehouse through ingestion jobs or streaming pipelines.
The warehouse keeps data such as:
- payment attempts
- payment failures
- contest joins
- wallet spend
- experiment exposures
- feature flag decisions
- user sessions
- device and region metadata
This is where analysts and causal jobs can query large windows of user behavior.
For experimentation, the warehouse helps answer:
text1Did treatment outperform control?
For causal inference, the warehouse helps answer:
text1Among similar users, what changed after one group experienced a treatment?
Those are related questions, but they are not the same.
Feature Store
The Feature Store converts raw events into user-level features.
Raw events are too granular for causal matching. The platform needs summarized covariates that describe users before the treatment happened.
For the deposit failure example, useful features may include:
- deposit attempts in the last 30 days
- deposit success rate before treatment
- wallet spend before treatment
- contest joins before treatment
- preferred payment method
- device type
- region
- app version
- account age
- language
These features are used to compare treated users with similar untreated users.
This is the heart of observational causal inference.
The system is trying to answer:
text1If this user had not experienced a deposit failure, what would likely have happened?
Since we cannot replay reality, we build a credible comparison group.
Causal Inference Platform
The Causal Inference Platform is the offline analysis system.
It lets a PM or analyst define a causal question through a low-code interface or API.
Example:
text1Treatment: deposit_failed 2Outcome: wallet_spent_7d 3Population: users who attempted a deposit 4Time window: last 30 days 5Covariates: region, device, prior spend, prior joins, account age 6Method: propensity score matching
The platform then builds treatment and control groups.
Treatment group:
text1Users who experienced a deposit failure
Control group:
text1Similar users who attempted a deposit but did not experience a failure
The hard part is similarity.
Users who experience payment failures may not be random. They may be from different regions, banks, devices, app versions, or network conditions. They may also behave differently before the failure.
If the platform simply compares failed users against all successful users, the estimate may be biased.
That is why the platform uses matching or adjustment.
Propensity Score Matching
Propensity score matching is one common way to build a better control group.
The platform estimates each user's probability of receiving the treatment based on pre-treatment features.
In this example:
text1probability of experiencing deposit failure
Then it matches treated users with untreated users who had similar propensity scores.
Simplified flow:
text1collect pre-treatment features 2estimate probability of treatment 3split users into treated and untreated groups 4match similar users 5compare downstream outcomes 6estimate treatment effect
The final result may look like:
text1Users with deposit failures spent Rs 18.70 less over the next 7 days compared with matched users.
This is not magic. It is still an estimate.
The quality depends on whether the platform captured the important confounding variables. If a hidden factor affects both deposit failure and wallet spend, the result can still be biased.
But it is much better than comparing raw averages.
Counterfactual Estimation
Causal inference is really about counterfactuals.
The platform is asking:
text1What would have happened to these same users if the deposit failure had not occurred?
We cannot observe that alternate reality directly.
So the system constructs a proxy using similar users who did not receive the treatment.
This is why the control group is so important.
Bad control group:
text1All users without deposit failures
Better control group:
text1Users with similar prior behavior, region, device, payment intent, and account history who attempted a deposit but succeeded
The better the control group, the more credible the causal estimate.
Causal Impact Report
After matching and estimation, the platform produces a report.
A useful report should include:
- treatment definition
- outcome definition
- number of treated users
- number of matched control users
- estimated impact
- confidence interval
- segment-level breakdown
- data quality warnings
- method used
- assumptions
For example:
text1Question: 2What is the impact of deposit failure on 7-day wallet spend? 3 4Finding: 5Matched users who experienced deposit failure spent Rs 18.70 less over the next 7 days. 6 7Interpretation: 8Deposit reliability likely has a measurable downstream revenue impact.
The report should not only give a number. It should explain how credible the number is.
Dashboard and PM Decision
The dashboard is where the causal result becomes a product decision.
A PM or analyst may use the report to decide:
- prioritize payment retry flows
- shift traffic away from weak payment providers
- improve observability for failed deposits
- create region-specific fallback flows
- build a user recovery campaign after failed deposits
This is where causal inference becomes operational.
The goal is not to produce academic analysis. The goal is to make better product and business decisions when A/B testing is not enough.
Why This Is Different From Only A/B Testing
A/B testing answers questions like:
text1What happened when we randomly assigned users to variant A or variant B?
Causal inference answers questions like:
text1What was the likely impact of an event or condition that we could not randomly assign?
Both are valuable.
A mature experimentation platform uses both.
A/B testing is stronger when clean randomization is possible.
Causal inference is useful when the system must reason from observational data.
Bandits are useful when the system needs to optimize continuously while learning.
Together, they create a dual-track experimentation system.
Why Heterogeneous Treatment Effects Matter
India is not one uniform user base.
Users vary by:
- state
- language
- city tier
- network quality
- device quality
- payment method
- sports preference
- income pattern
- digital comfort
A feature that works in Mumbai may fail in rural Rajasthan.
A payment provider that performs well for one bank may perform poorly for another.
A contest recommendation that improves engagement for power users may confuse new users.
This is why average treatment effect is often not enough.
The platform needs segment-level analysis.
Instead of only asking:
text1Did this work overall?
the better question is:
text1For whom did this work, where did it fail, and why?
That is where causal inference and feature-rich data infrastructure become powerful.
The Big Architectural Lesson
The interesting part of this architecture is not one specific algorithm.
The interesting part is the separation of responsibilities.
Real-time services make product decisions quickly.
The event bus records what happened.
The warehouse stores historical behavior.
The feature store turns raw events into comparable user profiles.
The causal inference platform estimates impact when randomization is not available.
The dashboard turns analysis into product action.
This separation lets the company move fast without losing the ability to learn.
Final Takeaway
Dream11-style inference architecture is not just an A/B testing system.
It is a decision intelligence system.
Feature flags control behavior.
Experiments measure randomized changes.
Bandits optimize live operational choices.
Causal inference estimates the impact of things that cannot be randomized.
For a high-scale fantasy sports platform, that combination matters because product, payments, and game behavior are deeply connected. A payment failure is not just a failed transaction. It can change trust, contest participation, wallet spend, and long-term retention.
The architecture exists to answer one hard question again and again:
text1What actually caused the user or business outcome we care about?
That is the real job of a causal inference platform.
