Skip to content
16px
How Anthropic Actually Serves Claude: Not Just vLLM, Not Just GPUs
AnthropicClaudeLLM InfrastructureSystem DesignInferenceMLOps

How Anthropic Actually Serves Claude: Not Just vLLM, Not Just GPUs

A public-evidence reconstruction of Claude's inference infrastructure: Dystro routing, prompt caching, service tiers, heterogeneous accelerators, long-context serving, and production evaluation.

May 1, 202611 min read

The Assumption Everyone Makes

When people talk about Claude infrastructure, they usually simplify it into one of three ideas:

  • Anthropic probably just runs Claude on AWS.
  • Claude is probably served through Bedrock and Vertex AI.
  • They are probably using vLLM or Triton like everyone else.

All three are partly reasonable. AWS is deeply involved. Bedrock and Vertex AI are real distribution channels. vLLM and Triton are excellent pieces of inference infrastructure.

But the public evidence points to something more interesting: Claude looks less like a model sitting behind a normal API gateway and more like a warehouse-scale, multi-cloud, heterogeneous inference platform with its own routing plane, prompt-cache strategy, accelerator-aware scheduling, service-tier enforcement, production evaluation systems, and hardware/software co-design.

The important caveat: Anthropic has not disclosed the exact Claude model topology, parameter counts, serving engine, batching algorithm, shard layout, or internal runtime. This is not a leak. It is a reconstruction from public evidence.

The Core Claim

Anthropic is not simply running Claude on GPUs.

Claude is served across AWS Trainium, NVIDIA GPUs, and Google TPUs. Anthropic has publicly discussed the challenge of serving across different hardware platforms while maintaining strict output-quality equivalence. Claude is also exposed through multiple surfaces: Anthropic's own API, Claude apps, Amazon Bedrock, Google Vertex AI, and Microsoft Foundry/Azure integrations.

That already tells us something important. A normal single-runtime setup is not enough. If the same model family has to run across Trainium, TPUs, NVIDIA GPUs, first-party APIs, Bedrock, Vertex, and Azure, then the hard problem is no longer just running matrix multiplication fast.

The hard problem is routing each request to the right hardware, with the right cache state, under the right service tier, while preserving output quality across different compiler stacks and accelerator families.

The Most Important Clue: Dystro

The strongest public signal comes from Anthropic's inference routing job descriptions. One Anthropic listing describes every Claude request as passing through a routing decision. Not generic round-robin. Not just picking any healthy server. The router considers what is already cached where, which accelerator the request runs best on, and what else is in flight across the fleet.

The same listing names Dystro as the coordination layer between the API surface and the inference engines.

That one detail changes the whole picture. This is not just Kubernetes plus autoscaling. Dystro appears to be a cluster-level inference coordination plane. Its job is to make fleet-wide decisions in real time: where to send a request, whether a cache hit is available, which accelerator is best suited for the workload, how much traffic is already in flight, whether the request belongs to short-context or long-context serving, and how to preserve quality and latency under load.

In other words, Claude's production serving problem is model execution plus distributed scheduling.

The Architecture in Words

Instead of drawing a diagram, the stack can be explained as a set of decisions that happen on every request.

First, a user, app, or enterprise customer reaches one of Claude's surfaces: claude.ai, the Anthropic API, Amazon Bedrock, Google Vertex AI, or Microsoft Foundry/Azure.

Then the API edge handles TLS termination, authentication, workspace and organization lookup, rate limits, quota checks, data residency policy, service-tier selection, and model/version selection.

After that, Dystro makes the placement decision. It likely considers cache locality, accelerator fit, fleet load, sticky routing or affinity, short-context versus long-context pool selection, and service-tier priority.

The request then lands on an inference pool, which may be backed by AWS Trainium, NVIDIA GPUs, Google TPUs, or dedicated Anthropic data-center capacity. The model runtime handles sharded model execution, KV cache management, batching, sampling, token streaming, and compiler/runtime-specific kernels.

Finally, telemetry, billing, and evaluation systems account for usage, service tiers, cache reads and writes, latency, throughput, quality evals, canaries, and incident detection.

Layer 1: API Edge and Product Surfaces

Claude is not exposed through one product surface. It is available through Anthropic's own API and apps, partner clouds, and enterprise integrations.

That means the API edge likely attaches metadata such as organization, workspace, API key or project, model version, service tier, context length, cache eligibility, partner platform, data residency constraints, quota state, rate-limit state, and request type.

This metadata matters because downstream scheduling cannot be only a machine-health decision. The router needs to know whether a request is synchronous, batch, streaming, tool-use heavy, long-running, long-context, enterprise priority, or partner-platform constrained.

Layer 2: Service Tiers Are Not Just Pricing

Anthropic publicly exposes Standard, Priority, and Batch service tiers. The safe claim is that these tiers are product and capacity-management mechanisms, not just billing labels.

Priority traffic needs stronger latency and availability behavior. Batch traffic can tolerate asynchronous execution. Standard traffic sits between them. Once a platform exposes differentiated service tiers, the serving stack needs queueing, admission control, placement, and capacity management that understand those tiers.

The unsafe claim would be pretending we know the exact internal queues or scheduling algorithm. We do not. But the product behavior strongly implies that service-tier metadata is part of routing and capacity decisions.

Layer 3: Prompt Caching as a First-Class Primitive

Prompt caching changes the scheduler's question.

It is not enough to ask which server is least loaded. The better question is: which server already has the expensive prefix state for this request?

That means Claude's inference platform likely cares about prompt prefix identity, cache placement, cache lifetime, cache eviction, cache hit rate, cache write cost, cache read cost, routing affinity, and multi-replica cache coordination.

For long prompts, a cache hit can avoid repeating expensive prefill work. But cache locality competes with load balancing. If every request for a hot prefix goes to the same server, cache hit rate improves but load can concentrate dangerously. If traffic is spread too widely, load improves but cache effectiveness falls. Dystro likely exists partly to manage that tradeoff.

Layer 4: Context Windows Create Different Server Pools

Claude supports normal context, long context, and 1M-token serving paths. These are not just bigger input limits. They change the economics of inference.

Long-context requests increase prefill compute, KV cache memory, routing affinity value, cache-miss cost, batching difficulty, tail-latency risk, memory fragmentation, and eviction pressure.

So Claude likely has differentiated serving pools for short-context serving, long-context serving, different Claude models, different hardware backends, different partner surfaces, and different customer tiers.

The routing layer has to balance at least three goals: keep latency predictable, keep accelerators utilized, and avoid letting expensive long-context workloads degrade ordinary short-context traffic.

Layer 5: Heterogeneous Hardware Is the Real Story

Different accelerators have different memory systems, compiler stacks, interconnects, collective communication libraries, kernel libraries, numerical behavior, precision tradeoffs, batching characteristics, and failure modes.

This is why the Claude story is bigger than AWS alone. Anthropic has deep AWS Trainium collaboration, Google TPU capacity, NVIDIA GPU paths, Microsoft/Azure enterprise distribution, and dedicated US infrastructure investment.

The more accurate story is that Anthropic is building a multi-provider compute portfolio: AWS for large-scale Trainium capacity, Google for TPU capacity, NVIDIA GPUs for GPU paths, Microsoft/Azure for enterprise distribution, and dedicated infrastructure for long-term capacity control.

That makes the scheduler more complicated, but it also gives Anthropic more leverage over cost, availability, and supply constraints.

Layer 6: The Runtime Is Probably Custom

There is no strong public evidence that Claude production is just vLLM.

vLLM is excellent. PagedAttention changed how the industry thinks about KV cache memory management. Triton, SGLang, and related serving stacks are important reference points.

But Claude has to decide whether a request should go to Trainium, TPU, or NVIDIA GPU; whether the prefix is already cached somewhere; whether the customer is using Standard, Priority, or Batch; whether the request is short-context, long-context, or 1M-context; whether it came from Anthropic API, Bedrock, Vertex, or Azure; whether there is a regional or data-residency constraint; whether the right model version is available on that hardware; and whether the decision preserves equivalent output quality.

That is more than a single open-source runtime. The safer and stronger claim is that Anthropic likely owns a custom routing and control plane around model runtimes, with hardware-specific execution paths underneath.

Layer 7: Prefill, Decode, and Batching

Serving LLMs is unusual because the workload has two very different phases.

Prefill processes the input prompt. It is compute-heavy, parallelizable, and becomes much more expensive as context grows. Decode generates tokens one at a time. It is latency-sensitive and often limited by memory bandwidth.

Anthropic has not publicly confirmed the exact prefill/decode split for Claude. But public hiring signals and industry constraints make prefill/decode-aware serving likely. A serious serving platform needs to optimize these phases differently, especially when long-context traffic and prompt caching are involved.

Continuous batching is also likely, but the exact implementation is unknown. The right wording is that Claude almost certainly uses advanced batching and request scheduling, while the exact batching algorithm remains private.

Layer 8: Compiler Bugs Prove How Deep Serving Goes

Anthropic's public postmortems are unusually valuable because they show that production inference quality depends on much more than sending a request to a model and streaming tokens back.

Serving quality can be affected by compiler behavior, numerical precision, distributed sorting, sampling implementation, batch size, model configuration, accelerator-specific behavior, and server configuration.

That is why production LLM serving needs evaluation in the serving path. Traditional backend monitoring asks whether the request returned 200, whether latency was acceptable, whether error rate increased, and whether machines were overloaded.

LLM production monitoring has to ask more: did model quality regress, did sampling behavior change, did one hardware backend produce different outputs, did a compiler optimization alter behavior, did long-context routing degrade short-context outputs, and did a rollout affect only one model/platform/customer tier?

Layer 9: Sticky Routing and Fault Domains

Cache-aware serving naturally pushes a system toward sticky routing. If a server already has a useful prompt prefix in memory, sending related requests back there can improve latency and reduce compute waste.

But affinity has a cost. It can concentrate traffic, worsen hot spots, and increase blast radius if routing is wrong. Low-confidence routing decisions can create bad load patterns, while overly aggressive failover can destroy cache locality.

The tradeoff is simple: affinity improves cache locality and efficiency, but affinity can worsen blast radius. A production router has to balance both.

Layer 10: Canarying and Rollouts Are Harder for LLMs

A Claude rollout has to validate more dimensions than a normal backend deploy: latency, throughput, cost per token, overloaded errors, output quality, safety behavior, sampling correctness, long-context behavior, platform parity, cache behavior, and regression on internal evals.

This is why observability and evaluation are part of serving, not separate research concerns. A model rollout can look healthy at the HTTP layer while being wrong at the model-behavior layer.

For a system spanning multiple clouds and accelerator families, canarying also has to answer whether a change is safe across hardware backends, compiler versions, partner surfaces, and customer tiers.

Layer 11: Compliance and Enterprise Isolation

Claude's infrastructure also has to support enterprise expectations around organization isolation, workspace boundaries, API key and project separation, auditability, rate limits, data handling policy, regional availability, partner-platform controls, priority capacity, and usage reporting.

Those requirements affect infrastructure design. They influence where requests can run, which logs and metrics are retained, how quotas are enforced, how partner platforms integrate, and how incident response is scoped.

This is another reason stock runtime alone is not the full story. Enterprise AI infrastructure is model serving plus routing plus tenancy plus policy plus billing plus evaluation.

What Claude Likely Does Internally

High-confidence claims from public Anthropic and cloud-provider sources:

  • Claude is served across multiple product surfaces.
  • Anthropic works deeply with AWS Trainium and Neuron.
  • Claude is also distributed through Google Vertex AI and Microsoft Foundry/Azure.
  • Anthropic exposes service tiers such as Standard, Priority, and Batch.
  • Prompt caching is a first-class product and infrastructure feature.
  • Production serving quality can be affected by compiler behavior, sampling implementation, accelerator behavior, and configuration.

Medium-confidence inferences from public evidence and industry constraints:

  • Dystro functions as a routing and coordination plane.
  • Claude uses cache-aware and accelerator-aware placement.
  • Short-context and long-context workloads are likely separated or heavily differentiated.
  • Prefill/decode-aware optimization is likely.
  • Continuous batching or advanced request scheduling is likely.

Low-confidence or unknown details:

  • Exact Claude parameter counts.
  • Dense versus MoE topology.
  • Exact sharding strategy.
  • Exact serving runtime.
  • Exact batching algorithm.
  • Exact prefill/decode split.
  • Cache eviction policy.
  • Internal RPC protocol.
  • Exact hardware mapping per model.

The Claude Infrastructure Thesis

The public evidence points to a compute-agnostic architecture in the hard systems sense: one model family, multiple clouds, multiple accelerators, multiple product surfaces, multiple service tiers, multiple context windows, different compiler stacks, different routing constraints, and the same expected output quality.

That is the actual infrastructure challenge.

The right framing is not whether Anthropic uses vLLM somewhere. The better framing is that Claude requires a custom coordination layer around heterogeneous compute. Open-source runtimes may exist in parts of the stack, but the hard production problem is global routing, caching, service tiers, hardware selection, evals, canaries, and enterprise controls.

Why This Matters

If you are building serious LLM infrastructure, Claude's public signals raise the right questions:

  • Do you route based on cache locality, or only load?
  • Do you separate short-context and long-context traffic?
  • Do you treat prompt caching as a product feature or just a runtime optimization?
  • Can you maintain quality parity across different hardware?
  • Do you monitor model behavior, not just server health?
  • Can your scheduler understand service tiers?
  • Can you move workloads across hardware types without breaking output quality?
  • Do you know when a compiler optimization changes model behavior?
  • Can your system survive a bad routing change without expanding the blast radius?

These are the problems that separate a demo inference server from a production LLM platform.

Final Summary

Claude is not just a model on AWS. Claude is not just vLLM at scale.

Claude looks like a custom, heterogeneous inference platform built around Dystro as a routing and coordination layer, prompt-cache-aware scheduling, accelerator-aware placement, Standard, Priority, and Batch service lanes, short-context and long-context serving pools, AWS Trainium, NVIDIA GPU, and Google TPU deployments, deep Trainium/Neuron collaboration with AWS, Google/Broadcom TPU capacity expansion, Microsoft Foundry/Azure enterprise distribution, dedicated US infrastructure investment, production quality evaluation, canary deployment, and platform-specific compiler and sampling validation.

Anthropic has not published every internal detail. But the public evidence is enough to say this: the hard part of Claude infrastructure is not only running the model. It is coordinating model execution across caches, accelerators, clouds, service tiers, long-context workloads, enterprise boundaries, and quality guarantees.

Source Notes

This reconstruction is based on public Anthropic and cloud-provider material, including Anthropic's postmortem on Claude serving issues, Anthropic API service-tier documentation, Anthropic's prompt caching announcement, Anthropic's Amazon compute announcement, Anthropic's Google/Broadcom partnership announcement, Anthropic's Microsoft Foundry announcement, Anthropic's US infrastructure announcement, AWS Trainium/Neuron documentation, Amazon Bedrock Claude documentation, and open-source inference references such as vLLM, Triton, and SGLang.

Bhupesh Kumar

Bhupesh Kumar

Backend engineer building scalable APIs and distributed systems with Node.js, TypeScript, and Go.