Every Token Has a Price: Why LLM Cost Telemetry Is Now Production Infrastructure

Every Token Has a Price: Why LLM Cost Telemetry Is Now Production Infrastructure

A team ships an internal assistant that “just summarizes docs.” Usage triples after rollout. Two weeks later, finance flags a spike in LLM spend. Engineering cannot answer basic questions: Which app caused it? Which prompts? Which users? Which model? Which retries or agent loops? The system is working. The bill is not explainable.

This is not a failure of the model. It is a failure of visibility.

Between 2023 and 2025, AI observability and FinOps moved from optional tooling to core production infrastructure for LLM applications. The driver is straightforward: LLM costs are variable per request, difficult to attribute after the fact, and can scale faster than traditional cloud cost controls.

Unlike traditional compute, where costs correlate roughly with traffic, LLM costs can spike without any change in user volume. A longer prompt, a retrieval payload that grew, an agent loop that ran one extra step: each of these changes the bill, and none of them are visible without instrumentation built for this purpose.

Context: A Three-Year Shift

Research shows a clear timeline in how this capability matured:

2023: Early, purpose-built LLM observability tools emerge (Helicone, LangChain’s early LangSmith development). The core problem was visibility into prompts, models, and cost drivers across providers. At this stage, most teams had no way to answer “why did that request cost what it cost.”

2024: LLM systems move from pilot to production more broadly. This is the point where cost management becomes operational, not experimental. LangSmith’s general availability signals that observability workflows are becoming standard expectations, not optional add-ons.

2025: Standardization accelerates. OpenTelemetry LLM semantic conventions enter the OpenTelemetry spec in January 2025. Enterprise LLM API spend grows rapidly. The question shifts from “should we instrument” to “how fast can we instrument.”

Across these phases, “observability” expands from latency and error rates into token usage, per-request cost, prompt versions, and evaluation signals.

How the Mechanism Works

This section describes the technical pattern that research indicates is becoming standard, separating the build pattern from interpretation.

1. The AI Gateway Pattern as the Control Point

The dominant production architecture for LLM observability and cost tracking is the “AI gateway” (or proxy).

What it does:

Sits between applications and model providers (or self-hosted models)
Centralizes authentication, routing, rate limiting, failover, and policy enforcement
Captures request metadata consistently, instead of relying on each application team to instrument perfectly

Why it matters mechanically:

Because LLM usage is metered at the request level (tokens), the gateway becomes the most reliable place to measure tokens, compute cost, and attach organizational metadata. Without a gateway, instrumentation depends on every team doing it correctly. With a gateway, instrumentation happens once.

Typical request flow:

User request → Gateway (metadata capture) → Guardrails/policy checks → Model invocation → Response → Observability pipeline → Analytics

2. Token-Based Cost Telemetry

Token counts are the base unit for cost attribution.

Typical per-request capture fields:

Timestamp, request or trace ID
User ID, workspace ID, project/app identifiers
Model and provider
Input tokens, output tokens (and cache-related token fields where available)
Calculated cost fields (input, output, total)
Prompt hash or prompt version tag
Success or failure flags
Agent step identifiers for multi-step workflows

Research emphasizes that cost complexity drivers appear only when measuring at this granularity: input versus output token price asymmetry, caching discounts, long-context tier pricing, retries, and fallback routing. None of these are visible in aggregate metrics.

3. OpenTelemetry Tracing and LLM Semantic Conventions

Distributed tracing is the backbone for stitching together an LLM request across multiple services.

A trace represents the end-to-end request
Spans represent individual operations (retrieval, prompt construction, LLM call, tool execution)

OpenTelemetry introduced standardized LLM semantic conventions (attributes) for capturing:

Model identifier
Prompt token count, completion token count, total token count
Invocation parameters
Cache-related usage attributes

This matters because it makes telemetry portable across backends (Jaeger, Datadog, New Relic, Honeycomb, vendor-specific systems) and reduces re-instrumentation work when teams change tools.

4. Cost Attribution and Showback Models

Research describes three allocation approaches:

Direct attribution to a user/team/app when metadata is available
Proportional allocation for shared infrastructure based on usage
Heuristic allocation when cost correlates with usage but is not directly metered per request

Operationally, “showback” is the minimum viable step: make cost visible to the teams generating it, even without enforcing chargeback. Visibility alone changes behavior.

What Happens Without This Infrastructure

Consider a second scenario. A product team launches an AI-powered search feature. It uses retrieval-augmented generation: fetch documents, build context, call the model. Performance is good. Users are happy.

Three months later, the retrieval index has grown. Average context length has increased from 2,000 tokens to 8,000 tokens. The model is now hitting long-context pricing tiers. Costs have quadrupled, but traffic has only doubled.

Without token-level telemetry, this looks like “AI costs are growing with usage.” With token-level telemetry, this is diagnosable: context length per request increased, triggering a pricing tier change. The fix might be retrieval tuning, context compression, or a model swap. But without the data, there is no diagnosis, only a budget conversation with no actionable next step.

Analysis

Why This Matters Now

Three factors explain the timing:

LLM costs scale with usage variability, not just traffic. Serving a “similar number of users” can become dramatically more expensive if prompts grow, retrieval payloads expand, or agent workflows loop. Traditional capacity planning does not account for this.

LLM application success is not binary. Traditional telemetry answers “did the request succeed.” LLM telemetry needs to answer “was it good, how expensive was it, and what changed.” A 200 OK response tells you almost nothing about whether the interaction was worth its cost.

The cost surface is now architectural. Cost is a design constraint that affects routing, caching, evaluation workflows, and prompt or context construction. In this framing, cost management becomes something engineering owns at the system layer, not something finance reconciles after the invoice arrives.

Implications for Enterprises

Operational implications:

Budget and anomaly controls move into runtime. Alerting and thresholds must tie to usage patterns, not just monthly invoices.
Governance requires attribution. Without the ability to tie spend to app, team, user segment, and interaction type, it becomes difficult to prioritize fixes or defend ROI decisions.
Cross-functional reporting becomes unavoidable. Engineering, platform, and FinOps need a shared scorecard, because cost and performance trade off continuously.

Technical implications:

Gateway-first instrumentation becomes a default pattern. It reduces inconsistent logging across teams and makes routing, caching, and policy enforcement implementable once.
Tracing becomes a requirement, not a nice-to-have. Without trace stitching, diagnosing whether cost spikes came from retrieval, retries, agent steps, or a provider-level change is guesswork.
Metric definitions need standardization. Token counts, cost-per-interaction, and cost-per-outcome only work if teams compute them consistently across environments and products.
Evaluation joins observability. Research repeatedly connects observability tooling with continuous evaluation workflows, because “working” does not mean “acceptable.”

The Quiet Risk: Agent Loops

One pattern deserves particular attention. Agentic workflows, where models call tools, evaluate results, and decide next steps, introduce recursive cost exposure.

A simple example: an agent is asked to research a topic. It searches, reads, decides it needs more context, searches again, reads again, summarizes, decides the summary is incomplete, and loops. Each step incurs tokens. Without step-level telemetry and loop limits, a single user request can generate dozens of billable model calls.

Research flags this as an open problem. The guardrails are not yet standardized. Teams are implementing their own loop limits, step budgets, and circuit breakers. But without visibility into agent step counts and per-step costs, even well-intentioned guardrails cannot be tuned effectively.

Risks and Open Questions

These are open questions that research raises directly, not predictions.

Attribution gaps: How reliably can org metadata (team, product, tenant) attach to every request across heterogeneous apps and providers?
Provider pricing complexity: How do teams keep pricing tables, tiers, caching discounts, and tokenization differences up to date in cost calculations?
Retry and fallback invisibility: How many “successful” responses are actually multiple billable attempts underneath, and are teams measuring that?
Agent workflow cost containment: What guardrails prevent recursive tool calls or long multi-step trajectories from becoming runaway spend?
Reconciliation: Can internal token ledgers reconcile to provider invoices closely enough to be trusted for chargeback or executive reporting?
Tooling portability: Even with OpenTelemetry conventions, how much lock-in remains in evaluation workflows, proprietary metrics, and data schemas?

Blogs

White Papers

Case Studies

Blogs

White Papers

Case Studies

Blogs

White Papers

Case Studies

Blogs

White Papers

Case Studies

Every Token Has a Price: Why LLM Cost Telemetry Is Now Production Infrastructure

Context: A Three-Year Shift

How the Mechanism Works

1. The AI Gateway Pattern as the Control Point

2. Token-Based Cost Telemetry

3. OpenTelemetry Tracing and LLM Semantic Conventions

4. Cost Attribution and Showback Models

What Happens Without This Infrastructure

Analysis

Why This Matters Now

Implications for Enterprises

The Quiet Risk: Agent Loops

Risks and Open Questions

Further Reading

Contact Us

Contact Us