The Prompt Is No Longer the Unit of Design

The Architecture Question

During Google’s Agent Bake-Off, a team let their AI agent calculate compound interest directly. The model hallucinated the math. Google describes what followed as “massive validation errors,” and the root cause was not a bad prompt. The cause was that a probabilistic model was performing a task that required deterministic execution.

The team that won the same challenge used the model to extract parameters and orchestrate the workflow, but routed every calculation to conventional code. The difference was not better prompting. It was a different system architecture.

That distinction, between what the model should reason about and what code should execute, is the engineering question now running through Google’s and Anthropic’s recent agent guidance. The answer is reshaping how production agent systems get designed.

The Short Version

Between January and April 2026, Google published a series of engineering guidance documents that reframe agent development as a systems architecture discipline. Google Research’s multi-agent scaling study quantifies when coordination helps and when it hurts. The Agent Bake-Off distills five patterns from live competition. The Agent Development Kit provides eight canonical design patterns. And the Gemini CLI subagents feature turns agent topology into declarative configuration files.

Anthropic’s “Building Effective Agents” guidance reaches a similar conclusion from the opposite direction: start with the simplest architecture that works and add complexity only when the task requires it.

The convergent argument: production reliability comes from system decomposition, deterministic execution boundaries, and protocol-based integration, not from better prompts for a single monolithic agent.

What the Bake-Off Found and How the Mechanism Works

Google’s Agent Bake-Off April 14, 2026) distilled five engineering patterns from teams competing on production-style challenges. Google’s framing is direct:

prompting a single large agent to handle intent extraction, retrieval, and reasoning all at once is “a fast track to hallucinations and latency spikes.” Google documents “instruction dilution” as the primary failure mode, where accumulated context degrades the model’s ability to follow strict formatting or logic.

The core mechanism is decomposition plus bounded coordination.

Decompose into specialist micro-agents. A supervisor handles intent and planning. Specialists handle bounded execution. Each specialist operates in its own context with its own tools, and returns a consolidated result to the orchestrator. The orchestrator never sees the full execution trace, only the output. This keeps the primary context lean and prevents one specialist’s intermediate work from degrading the next interaction.

Route precision tasks to deterministic code paths. The banking challenge failure in the scenario above is the pattern in miniature. The agent’s role is to extract parameters and orchestrate. Conventional code or SQL performs the final computation. This applies to any task where exactness matters: financial calculations, data validation, schema enforcement, unit conversions. It is a systems boundary between probabilistic and deterministic execution.

Integrate open protocols over custom glue. Google explicitly recommends MCP for tool integration and A2A for agent-to-agent coordination rather than bespoke wrappers for every integration.

Treat multimodality as a native architectural feature. Teams that bolted image processing onto text-only architectures produced worse results than those that integrated multimodal models as a core design element.

Test against real-world failure modes. Move beyond demo-quality evaluation to adversarial inputs and failure recovery.

Google’s ADK translates these principles into eight reusable design patterns: sequential pipeline, coordinator/dispatcher, parallel fan-out, evaluator-optimizer loop, group chat, hierarchical delegation, custom orchestration, and human-in the-loop. Each maps to a coordination topology rather than a prompt structure.

Gemini CLI’s subagents feature April 15, 2026) makes these patterns configurable through declarative files. Each subagent is defined as a Markdown file with YAML frontmatter specifying name, description, tools, model, temperature, max_turns, and timeout. Tool access is scoped per subagent. Different subagents can connect to different MCP servers without sharing state. The specialist is no longer a section of a larger prompt. It is a deployable, version able artifact that can be code-reviewed, committed to a repository, and shared across teams.

What the Scaling Study Quantified

Google Research’s “Towards a Science of Scaling Agent Systems” December 2025) tested the decomposition argument empirically. The study evaluated 180 configurations across five architectures, three LLM families, and four benchmarks, with standardized tools and token budgets to isolate architectural effects.

Error amplification is topology-dependent. Independent agents operating without validation amplified errors up to 17.2x. Centralized coordination, where an orchestrator validates outputs before passing them along, contained amplification to 4.4x.

Benefits are task-contingent. Centralized coordination improved performance by 80.9% on parallelizable tasks like financial data aggregation. On sequential reasoning tasks, every multi-agent variant degraded performance by 39 to 70%. The agents spent their token budget on coordination overhead rather than problem-solving.

Capability saturation sets a ceiling. Adding coordination overhead produces negative returns when a single agent already performs above approximately 45% on a task.

Google Research also built a predictive model using task properties (sequential dependencies, tool density, decomposability) that identifies the optimal architecture for 87% of unseen configurations. Architecture selection can be a principled engineering decision based on task analysis, not a guess.

Anthropic’s “Building Effective Agents” guidance reinforces the caution embedded in these findings. Anthropic distinguishes workflows LLMs orchestrated through predefined code paths) from agents LLMs dynamically directing their own process) and recommends starting with workflows wherever possible. Anthropic explicitly warns against framework abstraction: “Incorrect assumptions about what’s under the hood are a common source of customer error.” The recommendation is to increase complexity only when the task demonstrably requires it.

Why This Matters for Engineering Teams

The shift changes what skills agent engineering requires. Prompt engineering remains relevant for individual agent behavior, but the higher-order decisions are now systems decisions: decomposition strategy, coordination topology, deterministic/probabilistic boundaries, tool-access scoping, and protocol integration.

The deterministic/probabilistic boundary is the most underappreciated part of this shift. Google’s Bake-Off results make clear that allowing a model to perform calculations, validation, or data lookups that could be handled by code is an engineering failure, not a prompt failure. Identifying which parts of a workflow should be deterministic and routing them to code paths is a systems design skill.

Cost management becomes architectural. Early adopters report that swarm-style architectures can generate 20 to 100 or more LLM calls per user task, driving teams toward smaller, cheaper models Gemini Flash, GPT4o mini) for sub-agent roles while reserving larger models for orchestration. Shopify’s engineering position reflects this: sub-agents with narrow toolsets (a refund agent with access only to the refund API) are reported to be significantly cheaper and more reliable than generalist swarms.

What Remains Uncertain

Google Research’s predictive model explains about half of the performance variance R² 0.513 . Tool-heavy tasks cause coordination inefficiencies the model does not fully capture, and the researchers note the need for specialized coordination protocols for tool-intensive tasks.

Google’s Bake-Off patterns come from competition, not production. Live competitive environments surface certain failure modes clearly, but production deployments may reveal others around long-running state management, failure recovery, and cost control at scale.

There is also an unresolved tension between simplicity and capability. Anthropic’s guidance recommends conservative complexity growth. Google’s ADK and Gemini CLI documentation show increasingly sophisticated topologies and coordination surfaces. Neither vendor treats this as a contradiction, but the practical question for engineering teams is clear: how much modularity and coordination can a given task support before the overhead outweighs the benefit? Google Research’s 39 70% degradation finding on sequential tasks is a concrete reminder that the answer is not always “more.”

Blogs

White Papers

Case Studies

Blogs

White Papers

Case Studies

The Prompt Is No Longer the Unit of Design

The Prompt Is No Longer the Unit of Design

The Architecture Question

The Short Version

What the Bake-Off Found and How the Mechanism Works

What the Scaling Study Quantified

Why This Matters for Engineering Teams

What Remains Uncertain

Further Reading

Contact Us

Contact Us

Contact Us

Contact Us