If you're still writing system prompts in a single text file and pasting them into an API call, you're operating the way we built websites in 1998 — hand-editing HTML and uploading it via FTP. The production frontier has moved. System prompts are programs now, assembled at runtime from dozens of conditional components, and the teams getting the best results treat them that way.
Claude Code's 40-Component Prompt Machine
Drew Breunig recently published a teardown of how Claude Code constructs its system prompt, and the architecture is genuinely wild. The prompt isn't a document. It's an assembly pipeline with roughly 40 distinct components, each governed by its own inclusion logic.
Three categories of components feed into the final context:
Always-on blocks — foundational behavior rules, coding philosophy, security constraints. These never change.
Conditional blocks — sections that appear or vanish based on session state. REPL mode active? You get a stripped-down tool guidance section. Running in CI? Shell shortcut instructions disappear entirely. Anthropic internal user? The "Output Efficiency" section reads completely differently than what external users see.
Variable blocks — content that adapts. Custom output styles override the intro section. Language preferences permeate the entire prompt. The verification requirements section only fires when the agent is about to touch three or more files.
The kicker: beyond the prompt itself, Claude Code manages around 50 tool definitions with their own conditional descriptions, plus user-provided CLAUDE.md files, git status snapshots, environment metadata, and conversation history compressed through roughly a dozen different compaction strategies. A cache boundary marker called SYSTEM_PROMPT_DYNAMIC_BOUNDARY splits the stable prefix from the volatile suffix so that Anthropic's prompt caching can cut token costs by up to 90% on the repeated portions.
This isn't prompt engineering. It's prompt software engineering.
Vercel's v0 Does Something Similar
Vercel's coding agent takes a different but philosophically aligned approach. Their team wrote openly about using intent detection — embeddings plus keyword matching — to decide what gets injected into the prompt for each turn.
When the system detects you're working with the AI SDK, it injects version-specific documentation. When it identifies a routing question, it pulls from hand-curated code sample directories designed specifically for LLM consumption. Nothing gets included by default. Everything is earned through relevance detection.
Their design philosophy is blunt: "Your product's moat cannot be your system prompt." The prompt is a steering mechanism that works alongside other pipeline components — LLM Suspense for streaming reliability, autofixers for post-generation correction. No single piece carries the whole load.
The pattern that jumps out across both systems: maximize cache hits by keeping the stable portion large and the dynamic portion small. Vercel explicitly notes they keep injected knowledge consistent to improve prompt-cache utilization and reduce token costs. Anthropic splits their prompt at a cache boundary. Both teams independently converged on the same optimization.
The SPEAR Framework Makes It Academic
A CIDR 2026 paper introduced SPEAR — a framework that promotes prompts to "first-class entities" with an executable prompt algebra supporting reuse and adaptive runtime refinement. The authors' complaint is pointed: most frameworks treat prompts as static strings with zero support for structured management, introspection, or optimization.
SPEAR proposes composable prompt operations. Think of it like SQL for prompt assembly — you define transformations, combine components, and the runtime figures out the optimal execution plan. It's early-stage research, but it signals where the field is heading.
Five Layers, Not One String
DextraLabs published a framework that captures the emerging consensus. Production prompt systems decompose into five layers:
System-level intent — behavioral boundaries, what the model may and may not do
Task-level instructions — what it should do right now
Contextual knowledge — retrieved documents, policies, database results
User input — the actual request
Output constraints — format requirements, safety rails, schema enforcement
Each layer evolves independently. Compliance teams own layer one. Product teams own layer two. The RAG pipeline owns layer three. Nobody touches each other's components. Version control, code review, and systematic testing apply to each layer separately.
This is the "prompt systems" argument in a nutshell: you don't need better prompts, you need prompt architecture.
What This Actually Means for You
If you're building anything beyond a chatbot demo, here's the practical takeaway: stop thinking about your system prompt as text you write and start thinking about it as code you compile.
Concretely:
Audit your current prompt for conditional logic. If you're using the same prompt regardless of user state, session context, or task type, you're leaving performance on the table. Even basic branching — different instructions for different user roles — can meaningfully improve output quality.
Split stable and dynamic sections. Put your behavioral rules and identity at the top where they can be cached. Put context-dependent material after a clear boundary. Your API bill will thank you.
Version your prompt components independently. When the compliance team needs to update safety guardrails, they shouldn't have to touch the same file that contains your few-shot examples.
The irony is that "prompt engineering" as a phrase implies craft — careful wordsmithing, artisanal token selection. What's actually winning in production is prompt engineering in the boring, traditional sense. Modularity. Separation of concerns. Cache optimization. Conditional compilation. The same patterns that made software reliable decades ago, applied to a new substrate.
Nobody hand-edits their Kubernetes manifests in production anymore. Give it another year, and hand-editing system prompts will feel just as quaint.