G360 Technologies

The Operations Room

The Operations Room

Retrieval Is the New Control Plane

Retrieval Is the New Control Plane A team ships a RAG assistant that nails the demo. Two weeks into production, answers start drifting. The policy document exists, but retrieval misses it. Permissions filter out sensitive content, but only after it briefly appeared in a prompt. The index lags three days behind a critical source update. A table gets flattened into gibberish. The system is up. Metrics look fine. But users stop trusting it, and humans quietly rebuild the manual checks the system was supposed to replace. This is the norm, not the exception. Enterprise RAG has an awkward secret: most pilots work, and most production deployments underperform. The gap is not model quality. It is everything around the model: retrieval precision, access enforcement, index freshness, and the ability to explain why an answer happened. RAG is no longer a feature bolted onto a chatbot. It is knowledge infrastructure, and it fails like infrastructure fails: silently, gradually, and expensively. The Maturity Gap Between 2024 and 2026, enterprise RAG has followed a predictable arc. Early adopters treated it as a hallucination fix: point a model at documents, get grounded answers. That worked in demos. It broke in production. The inflection points that emerged: One pattern keeps recurring: organizations report high generative AI usage but struggle to attribute material business impact. The gap is not adoption. It is production discipline. The operational takeaway: every bullet above is a failure mode that prototypes ignore and production systems must solve. Hybrid retrieval, reranking, evaluation, observability, and freshness are not enhancements. They are the difference between a demo and a system you can defend in an incident review. How Production RAG Actually Works A mature RAG pipeline has five stages. Each one can fail independently, and failures compound. Naive RAG skips most of this: embed documents, retrieve by similarity, generate. Production RAG treats every stage as a control point with its own failure modes, observability, and operational requirements. 1. Ingestion and preprocessing Documents flow in from collaboration tools, code repositories, knowledge bases. They get cleaned, normalized, and chunked into retrievable units. If chunking is wrong, everything downstream is wrong. 2. Embedding and indexing Chunks become vectors. Metadata gets attached: owner, sensitivity level, org, retention policy, version. This metadata is not decoration. It is the enforcement layer for every access decision that follows. 3. Hybrid retrieval and reranking Vector search finds semantically similar content. Keyword search (BM25) finds exact matches. Reranking sorts the combined results by actual relevance. Skip any of these steps in a precision domain, and you get answers that feel right but are not. 4. Retrieval-time access enforcement RBAC, ABAC, relationship-based access: the specific model matters less than the timing. Permissions must be enforced before content enters the prompt. Post-generation filtering is too late. The model already saw it. 5. Generation with attribution and logging The model produces an answer. Mature systems capture everything: who asked, what was retrieved, what model version ran, which policies were checked, what was returned. Without this, debugging is guesswork. Where Latency Budgets Get Spent Users tolerate low-single-digit seconds for a response. That budget gets split across embedding lookup, retrieval, reranking, and generation. A common constraint: if reranking adds 200ms and you are already at 2.5 seconds, you either cut candidate count, add caching, or accept that reranking is a luxury you cannot afford. Caching, candidate reduction, and infrastructure acceleration are not optimizations. They are tradeoffs with direct quality implications. A Hypothetical: The Compliance Answer That Wasn’t A financial services firm deploys a RAG assistant for internal policy questions. An analyst asks: “What’s our current position limit for emerging market equities?” The system retrieves a document from 2022. The correct policy, updated six months ago, exists in the index but ranks lower because the old document has more keyword overlap with the query. The assistant answers confidently with outdated limits. No alarm fires. The answer is well-formed and cited. The analyst follows it. The error surfaces three weeks later during an audit. This is not a model failure. It is a retrieval failure, compounded by a freshness failure, invisible because the system had no evaluation pipeline checking for policy currency. Why This Is Urgent Now Three forces are converging: Precision is colliding with semantic fuzziness. Vector search finds “similar” content. In legal, financial, and compliance contexts, “similar” can be dangerously wrong. Hybrid retrieval exists because pure semantic search cannot reliably distinguish “the policy that applies” from “a policy that sounds related.” Security assumptions do not survive semantic search. Traditional IAM controls what users can access. Semantic search surfaces content by relevance, not permission. If sensitive chunks are indexed without enforceable metadata boundaries, retrieval can leak them into prompts regardless of user entitlement. Access filtering at retrieval time is not a nice-to-have. It is a control requirement. Trust is measurable, and it decays. Evaluation frameworks like RAGAS treat answer quality like an SLO: set thresholds, detect regressions, block releases that degrade. Organizations that skip this step are running production systems with no quality signal until users complain. A Hypothetical: The Permission That Filtered Too Late A healthcare organization builds a RAG assistant for clinicians. Access controls exist: nurses see nursing documentation, physicians see physician notes, administrators see neither. The system implements post-generation filtering. It retrieves all relevant content, generates an answer, then redacts anything the user should not see. A nurse asks about medication protocols. The system retrieves a physician note containing a sensitive diagnosis, uses it to generate context, then redacts the note from the citation list. The diagnosis language leaks into the answer anyway. The nurse sees information they were never entitled to access. The retrieval was correct. The generation was correct. The filtering was correctly applied. The architecture was wrong. What Production Readiness Actually Requires Operational requirements: Technical requirements: Five Questions to Ask Before You Ship If any answer is “I don’t know,” the system is not production-ready. It is a demo running in production. Risks and Open Questions Authorization failure modes. Post-filtering is risky if

The Operations Room

Why Your LLM Traffic Needs a Control Room

Why Your LLM Traffic Needs a Control Room A team deploys an internal assistant by calling a single LLM provider API directly from the application. Usage grows quickly. One power user discovers that pasting entire documents into the chat gets better answers. A single conversation runs up 80,000 tokens. Then a regional slowdown hits, streaming responses stall mid-interaction, and support tickets spike. There is no central place to control usage, reroute traffic, or explain what happened. As enterprises move LLM workloads from pilots into production, many are inserting an LLM gateway or proxy layer between applications and model providers. This layer addresses operational realities that traditional API gateways were not designed for: token-based economics, provider volatility, streaming behavior, and centralized governance. There is a clear evolution. Early LLM integrations after 2022 were largely direct API calls optimized for speed of experimentation. By late 2023 through 2025, production guidance converged across open source and vendor platforms on a common architectural pattern: an AI-aware gateway that sits on the inference path and enforces usage, cost, routing, and observability controls. This pattern appears independently across open source projects (Apache APISIX, LiteLLM Proxy, Envoy AI Gateway) and commercial platforms (Kong, Azure API Management), which suggests the requirements are structural rather than vendor-driven. While implementations differ, the underlying mechanisms and tradeoffs are increasingly similar. When It Goes Wrong A prompt change ships on Friday afternoon. No code deploys, just a configuration update. By Monday, token consumption has tripled. The new prompt adds a “think step by step” instruction that inflates completion length across every request. There is no rollback history, no baseline to compare against, and no clear owner. In another case, a provider’s regional endpoint starts returning 429 errors under load. The application has no fallback configured. Users see spinning loaders, then timeouts. The team learns about the outage from a customer tweet. A third team enables a new model for internal testing. No one notices that the model’s per-token price is four times higher than the previous default. The invoice arrives three weeks later. These are not exotic edge cases. They are the default failure modes when LLM traffic runs without centralized control. How the Mechanism Works Token-aware rate limiting LLM workloads are consumption-bound rather than request-bound. A gateway extracts token usage metadata from model responses and enforces limits on tokens, not calls. Limits can be applied hierarchically across dimensions such as API key, user, model, organization, route, or business tag. The research describes sliding window algorithms backed by shared state stores such as Redis to support distributed enforcement. Some gateways allow choosing which token category is counted, such as total tokens versus prompt or completion tokens. This replaces flat per-request throttles that are ineffective for LLM traffic. Multi-provider routing and fallback Gateways decouple applications from individual model providers. A single logical model name can map to multiple upstream providers or deployments, each with weights, priorities, and retry policies. If a provider fails, slows down, or returns rate-limit errors, the gateway can route traffic to the next configured option. This enables cost optimization, redundancy, and resilience without changing application code. Cost tracking and budget enforcement The gateway acts as the system of record for AI spend. After each request completes, token counts are multiplied by configured per-token prices and attributed across hierarchical budgets, commonly organization, team, user, and API key. Budgets can be enforced by provider, model, or tag. When a budget is exceeded, gateways can block requests or redirect traffic according to policy. This converts LLM usage from an opaque expense into a governable operational resource. Streaming preservation Many LLM responses are streamed using Server-Sent Events or chunked transfer encoding. Gateways must proxy these streams transparently while still applying governance. A core challenge: token counts may only be finalized after a response completes, while enforcement decisions may need to happen earlier. Gateways address this through predictive limits based on request parameters and post-hoc adjustment when actual usage is known. A documented limitation is that fallback behavior is difficult to trigger once a streaming response is already in progress. Request and response transformation Providers expose incompatible APIs, schemas, and authentication patterns. Gateways normalize these differences and present a unified interface, often aligned with an OpenAI-compatible schema for client simplicity. Some gateways also perform request or response transformations, such as masking sensitive fields before forwarding a request or normalizing responses into a common structure for downstream consumers. Observability and telemetry Production gateways emit structured telemetry for token usage, latency, model selection, errors, and cost. There is an alignment with OpenTelemetry and OpenInference conventions to enable correlation across prompts, retrievals, and model calls. This allows platform and operations teams to treat LLM inference like any other production workload, with traceability and metrics suitable for incident response and capacity planning. Multi-tenant governance The gateway centralizes access control and delegation. Organizations can define budgets, quotas, and permissions across teams and users, issue service accounts, and delegate limited administration without granting platform-wide access. This consolidates governance that would otherwise be scattered across application code and provider dashboards. Prompt Lifecycle Management and Shadow Mode As LLM usage matures, prompts shift from static strings embedded in code to runtime configuration with operational impact. A prompt change can alter behavior, cost, latency, and policy compliance immediately, without a redeploy. For operations teams, this makes prompt management part of the production control surface. In mature gateway architectures, prompts are treated as versioned artifacts managed through a control plane. Each version is immutable once published and identified by a unique version or alias. Applications reference a logical prompt name, while the gateway determines which version is active in each environment. This allows updates and rollbacks without changing application binaries. The lifecycle typically follows a consistent operational flow. Prompts are authored and tested, published as new versions, and deployed via aliases such as production or staging. Older versions remain available for rollback and audit, so any output can be traced back to the exact prompt logic in effect at the time. Shadow mode