System design for LLM services: scaling, cost, latency
This article connects the earlier parts of the course into a production system design view.
From LLM theory, we will reuse the key mechanics that drive performance: tokenization cost, prefill vs decode, and KV cache.
From fine-tuning and adaptation, we will treat the model as one interchangeable component (base model + adapters) that must be deployed, versioned, rolled back.
From RAG, we will treat retrieval as a separate latency/cost block that must be measured and scaled independently.The goal for interviews: demonstrate that you can design an LLM service under constraints and explain trade-offs in SLO/SLA terms (latency p95, throughput, cost per successful task, reliability).
What interviewers mean by “LLM system design”
A typical prompt in an interview sounds like: “Design a customer support assistant for enterprise docs” or “Design an agent that calls internal tools”. The hidden checklist is usually:
How you decompose the system into components with clear contracts
How you estimate and control latency and cost drivers
How you scale safely (traffic spikes, multi-tenancy, GPU limits)
How you handle failures (fallbacks, timeouts, circuit breakers)
How you measure quality and regressions in productionYou are rarely expected to know one specific stack, but it is a strong signal if you can discuss realistic serving engines and orchestration.
References you can safely mention:
vLLM (high-throughput serving with paged attention)
NVIDIA Triton Inference Server (general inference serving)
Hugging Face Text Generation Inference (LLM serving)
Kubernetes Documentation (deployment and autoscaling concepts)A mental model of an LLM service
A robust design treats the LLM as one stage in a request pipeline.
!High-level components and boundaries of an LLM service
Core components and their responsibilities
API gateway
- Authentication and authorization
- Rate limiting and quotas (per user, per tenant)
- Input validation (size limits, allowed tools)
Router and policy layer
- Model selection (small vs large, base vs fine-tuned adapter)
- Traffic shaping (canary, A/B tests)
- Safety policies (moderation, tool allowlists)
Retrieval layer (optional, for RAG)
- Query embedding
- Candidate retrieval (sparse/dense/hybrid)
- Reranking
Context builder
- Token budgeting (how many tokens to allocate to retrieved context vs user prompt)
- Deduplication and formatting of citations
- Injection hardening (treat retrieved text as data, not instructions)
Inference layer
- Prefill (encode prompt)
- Decode (generate tokens)
- Batching, KV cache management, streaming
Post-processing
- Schema validation (JSON parse, required fields)
- Citation checks and guardrails
- Retry policy and fallbacks
Observability and feedback
- Metrics, logs, traces
- Sampling for human review
- Online quality signals (task success, escalations)
Latency: what it is made of and how to reduce it
A useful latency decomposition
A simple way to discuss end-to-end latency is:
Where:
is the end-to-end time the user experiences
is auth, routing, and request validation time
is embedding + vector search + reranking (if you use RAG)
is the time to process the input prompt tokens through the model
is the time to generate output tokens
is validation, formatting, and any downstream callsInterviewers like this because it forces you to identify the dominant term instead of “optimizing everything”.
The LLM-specific driver: prefill vs decode
From the theory article:
Prefill scales mostly with input tokens (prompt + retrieved context).
Decode scales mostly with output tokens and is typically sequential token-by-token.
KV cache makes decode efficient, but it consumes GPU memory and constrains concurrency.Practical implications:
Long RAG contexts can blow up .
Unlimited output lengths can blow up .
Tight token limits and good context budgeting often improve both latency and cost.Techniques to reduce latency (and what to say on interviews)
#### Reduce
Cache query embeddings for frequent queries.
Use a two-stage retrieval design: cheap retriever, selective reranker.
Degrade gracefully: if reranker times out, return top-k from retriever.#### Reduce
Reduce prompt size: shorter system prompts, structured policies, remove redundancy.
Reduce RAG context: fewer chunks, better reranking, summarization of retrieved chunks.
Use smaller context window models when the task allows it.#### Reduce
Stream tokens to the client so time-to-first-token is small even if completion is long.
Limit max output tokens; prefer concise responses by default.
Use constrained decoding and schema validation to reduce retries.#### Reduce and
Keep the gateway lightweight; do not run heavy NLP inside it.
Make tool calls explicit and schema-validated.
Use timeouts and circuit breakers for downstream dependencies.Throughput and scaling: how to serve many users on limited GPUs
Why LLM serving is not “just add replicas”
LLM inference on GPUs is constrained by:
GPU memory (model weights + KV cache)
Compute (matrix multiplications)
Scheduling efficiency (batching multiple requests)The same GPU can be either:
low-latency (small batches, faster per-user response)
high-throughput (large batches, better GPU utilization)You must choose based on product SLOs.
Continuous batching and why it matters
Traditional batching waits for a batch to fill. In LLM decode, requests have different lengths and arrive continuously.
Modern LLM servers often use continuous batching (also called dynamic batching): they continuously add new requests into the execution schedule to maximize GPU utilization while decoding.
This is a strong interview point because it connects directly to cost and throughput.
You can mention serving engines that implement these ideas:
vLLM
Hugging Face Text Generation InferenceKV cache: the hidden scaling limiter
KV cache stores attention keys and values for previously processed tokens.
Pro: decode becomes much cheaper per new token.
Con: KV cache grows with the number of active sequences and their length, consuming GPU memory.Scaling implication:
Higher concurrency means more KV cache, which can force smaller batch sizes or fewer parallel users.
Longer prompts (RAG) increase KV cache footprint, reducing maximum concurrency.A strong candidate explicitly asks about:
expected concurrency
average input tokens and output tokens
time-to-first-token SLO and completion SLOHorizontal scaling patterns
#### Separate control plane from data plane
Control plane: routing, policies, experiments, quotas.
Data plane: high-throughput inference workers.This separation prevents policy logic from becoming a bottleneck and simplifies GPU worker design.
#### Multi-model routing
Common routing strategies:
By tenant: enterprise tenants get a larger model.
By complexity: short/simple requests go to a smaller model.
By load: overload triggers fallback to a cheaper model.Key requirement: consistent evaluation so routing does not silently degrade user outcomes.
#### Queueing and backpressure
You need a queue when demand exceeds supply.
With no queue: you fail fast (many errors) but keep latency stable for admitted traffic.
With a queue: you admit more requests but risk violating latency SLO.In interviews, describe explicit policies:
max queue time
max queue size
priority tiers (interactive vs batch jobs)Cost: how to reason about dollars per request
The main cost drivers
You can model cost per request as:
GPU time (dominant for self-hosted)
Retrieval compute (embeddings + reranker)
Vector DB and storage
Network and logging
Human labeling and evaluation (dominant for alignment and high-stakes domains)A simple cost proxy for self-hosted inference is:
Where:
is relative cost
is the number of input tokens (prompt + retrieved context)
is the number of output tokensThis is not a perfect formula, but it is useful to communicate the key lever: tokens drive cost.
Practical cost levers
Reduce tokens
- tighter system prompts
- better chunk selection (reranking)
- summarization of context
Reduce model size
- model routing (small model by default)
- distillation for narrow tasks
Increase utilization
- continuous batching
- right-sized GPU instances
- avoid underloaded always-on replicas
Reduce retries
- schema validation + constrained decoding
- better refusal behavior and safer tool calling
Reliability and failure handling
LLM services fail in more ways than typical CRUD APIs.
Failure modes you should mention
Model overload: queue growth, timeouts, rising p95
Downstream dependency failures: vector DB or reranker unavailable
Format failures: invalid JSON, missing fields
Safety failures: policy violations, prompt injection through RAG
Data leakage: incorrect ACL filtering in retrievalA robust fallback ladder
A strong answer includes a planned degradation path, for example:
If reranker fails, use retriever top-k.
If retrieval fails, answer with “insufficient information” and ask clarifying questions.
If main model is overloaded, route to a smaller model.
If tool calls fail, return partial result with explicit status.The key is to keep behavior predictable and measurable.
Observability: what to measure in production
System metrics (engineering health)
Latency: p50, p95, p99 for and per-stage latencies
Throughput: requests per second, tokens per second
Error rates: timeouts, 5xx, validation failures
GPU metrics: utilization, memory, KV cache usageQuality metrics (product health)
From the RAG and fine-tuning articles, you should carry over the idea of component-level evals:
Retrieval quality: recall@k on a labeled set
Groundedness: citation coverage, unsupported-claim rate
Task success: completion rate, escalation rate
Safety: disallowed content rate, jailbreak rateLogging without leaking data
An interview-ready position:
Log minimal necessary data.
Redact PII.
Store prompts and completions behind strict access control.
Sample logs for review instead of storing everything.You can mention OWASP guidance for prompt injection as a security anchor:
OWASP Prompt InjectionRAG-specific system design trade-offs that impact latency and cost
Token budgeting as a first-class feature
The context builder should explicitly budget tokens, for example:
max retrieved tokens per request
max number of chunks
per-source caps (avoid one huge document dominating)A good interview answer ties this to prefill cost and to KV cache memory.
Freshness and indexing pipeline
Production RAG requires:
ingestion schedule
document versioning
deletion and invalidation
tenant and ACL enforcementA common design pattern:
store chunks with metadata (source, version, ACL)
apply ACL filtering before or immediately after retrievalDeployment and iteration: prompts, adapters, and safe rollouts
This is where you connect to the fine-tuning article.
Version everything
Base model version
Adapter version (LoRA/QLoRA) if used
Prompt templates (system/developer prompts)
Retrieval configuration (chunking, embedding model, reranker)
Evaluation datasets and thresholdsRelease strategies
Canary releases: small % traffic, compare metrics
A/B tests: compare end-to-end task success and cost
Rollback: immediate switch to previous model/prompt/retrieval config!Safe rollout and rollback for model/prompt/RAG changes
How to answer a system design prompt in an interview
A practical structure that consistently works:
Clarify requirements
- users, traffic, concurrency
- latency SLO (time-to-first-token and total)
- budget and cost constraints
- safety and compliance requirements
Propose a baseline architecture
- draw components and define boundaries
Identify top 2 latency and cost drivers
- usually retrieval and prefill/decode
Discuss scaling plan
- batching strategy
- routing and fallbacks
- queueing and backpressure
Discuss observability and evaluation
- what metrics you will watch
- how you will catch regressions
Discuss security and data governance
- ACL, prompt injection, logging policy
This approach demonstrates that you can build a production service, not just “call an LLM”.
!A structured whiteboard template for answering LLM system design questions