G360 Technologies

Author name: anuroop

Uncategorized

Every Token Has a Price: Why LLM Cost Telemetry Is Now Production Infrastructure

Every Token Has a Price: Why LLM Cost Telemetry Is Now Production Infrastructure A team ships an internal assistant that “just summarizes docs.” Usage triples after rollout. Two weeks later, finance flags a spike in LLM spend. Engineering cannot answer basic questions: Which app caused it? Which prompts? Which users? Which model? Which retries or agent loops? The system is working. The bill is not explainable. This is not a failure of the model. It is a failure of visibility. Between 2023 and 2025, AI observability and FinOps moved from optional tooling to core production infrastructure for LLM applications. The driver is straightforward: LLM costs are variable per request, difficult to attribute after the fact, and can scale faster than traditional cloud cost controls. Unlike traditional compute, where costs correlate roughly with traffic, LLM costs can spike without any change in user volume. A longer prompt, a retrieval payload that grew, an agent loop that ran one extra step: each of these changes the bill, and none of them are visible without instrumentation built for this purpose. Context: A Three-Year Shift Research shows a clear timeline in how this capability matured: 2023: Early, purpose-built LLM observability tools emerge (Helicone, LangChain’s early LangSmith development). The core problem was visibility into prompts, models, and cost drivers across providers. At this stage, most teams had no way to answer “why did that request cost what it cost.” 2024: LLM systems move from pilot to production more broadly. This is the point where cost management becomes operational, not experimental. LangSmith’s general availability signals that observability workflows are becoming standard expectations, not optional add-ons. 2025: Standardization accelerates. OpenTelemetry LLM semantic conventions enter the OpenTelemetry spec in January 2025. Enterprise LLM API spend grows rapidly. The question shifts from “should we instrument” to “how fast can we instrument.” Across these phases, “observability” expands from latency and error rates into token usage, per-request cost, prompt versions, and evaluation signals. How the Mechanism Works This section describes the technical pattern that research indicates is becoming standard, separating the build pattern from interpretation. 1. The AI Gateway Pattern as the Control Point The dominant production architecture for LLM observability and cost tracking is the “AI gateway” (or proxy). What it does: Why it matters mechanically: Because LLM usage is metered at the request level (tokens), the gateway becomes the most reliable place to measure tokens, compute cost, and attach organizational metadata. Without a gateway, instrumentation depends on every team doing it correctly. With a gateway, instrumentation happens once. Typical request flow: User request → Gateway (metadata capture) → Guardrails/policy checks → Model invocation → Response → Observability pipeline → Analytics 2. Token-Based Cost Telemetry Token counts are the base unit for cost attribution. Typical per-request capture fields: Research emphasizes that cost complexity drivers appear only when measuring at this granularity: input versus output token price asymmetry, caching discounts, long-context tier pricing, retries, and fallback routing. None of these are visible in aggregate metrics. 3. OpenTelemetry Tracing and LLM Semantic Conventions Distributed tracing is the backbone for stitching together an LLM request across multiple services. OpenTelemetry introduced standardized LLM semantic conventions (attributes) for capturing: This matters because it makes telemetry portable across backends (Jaeger, Datadog, New Relic, Honeycomb, vendor-specific systems) and reduces re-instrumentation work when teams change tools. 4. Cost Attribution and Showback Models Research describes three allocation approaches: Operationally, “showback” is the minimum viable step: make cost visible to the teams generating it, even without enforcing chargeback. Visibility alone changes behavior. What Happens Without This Infrastructure Consider a second scenario. A product team launches an AI-powered search feature. It uses retrieval-augmented generation: fetch documents, build context, call the model. Performance is good. Users are happy. Three months later, the retrieval index has grown. Average context length has increased from 2,000 tokens to 8,000 tokens. The model is now hitting long-context pricing tiers. Costs have quadrupled, but traffic has only doubled. Without token-level telemetry, this looks like “AI costs are growing with usage.” With token-level telemetry, this is diagnosable: context length per request increased, triggering a pricing tier change. The fix might be retrieval tuning, context compression, or a model swap. But without the data, there is no diagnosis, only a budget conversation with no actionable next step. Analysis Why This Matters Now Three factors explain the timing: LLM costs scale with usage variability, not just traffic. Serving a “similar number of users” can become dramatically more expensive if prompts grow, retrieval payloads expand, or agent workflows loop. Traditional capacity planning does not account for this. LLM application success is not binary. Traditional telemetry answers “did the request succeed.” LLM telemetry needs to answer “was it good, how expensive was it, and what changed.” A 200 OK response tells you almost nothing about whether the interaction was worth its cost. The cost surface is now architectural. Cost is a design constraint that affects routing, caching, evaluation workflows, and prompt or context construction. In this framing, cost management becomes something engineering owns at the system layer, not something finance reconciles after the invoice arrives. Implications for Enterprises Operational implications: Technical implications: The Quiet Risk: Agent Loops One pattern deserves particular attention. Agentic workflows, where models call tools, evaluate results, and decide next steps, introduce recursive cost exposure. A simple example: an agent is asked to research a topic. It searches, reads, decides it needs more context, searches again, reads again, summarizes, decides the summary is incomplete, and loops. Each step incurs tokens. Without step-level telemetry and loop limits, a single user request can generate dozens of billable model calls. Research flags this as an open problem. The guardrails are not yet standardized. Teams are implementing their own loop limits, step budgets, and circuit breakers. But without visibility into agent step counts and per-step costs, even well-intentioned guardrails cannot be tuned effectively. Risks and Open Questions These are open questions that research raises directly, not predictions. Further Reading

Uncategorized

Retrieval Is the New Control Plane

Retrieval Is the New Control Plane A team ships a RAG assistant that nails the demo. Two weeks into production, answers start drifting. The policy document exists, but retrieval misses it. Permissions filter out sensitive content, but only after it briefly appeared in a prompt. The index lags three days behind a critical source update. A table gets flattened into gibberish. The system is up. Metrics look fine. But users stop trusting it, and humans quietly rebuild the manual checks the system was supposed to replace. This is the norm, not the exception. Enterprise RAG has an awkward secret: most pilots work, and most production deployments underperform. The gap is not model quality. It is everything around the model: retrieval precision, access enforcement, index freshness, and the ability to explain why an answer happened. RAG is no longer a feature bolted onto a chatbot. It is knowledge infrastructure, and it fails like infrastructure fails: silently, gradually, and expensively. The Maturity Gap Between 2024 and 2026, enterprise RAG has followed a predictable arc. Early adopters treated it as a hallucination fix: point a model at documents, get grounded answers. That worked in demos. It broke in production. The inflection points that emerged: One pattern keeps recurring: organizations report high generative AI usage but struggle to attribute material business impact. The gap is not adoption. It is production discipline. The operational takeaway: every bullet above is a failure mode that prototypes ignore and production systems must solve. Hybrid retrieval, reranking, evaluation, observability, and freshness are not enhancements. They are the difference between a demo and a system you can defend in an incident review. How Production RAG Actually Works A mature RAG pipeline has five stages. Each one can fail independently, and failures compound. Naive RAG skips most of this: embed documents, retrieve by similarity, generate. Production RAG treats every stage as a control point with its own failure modes, observability, and operational requirements. 1. Ingestion and preprocessing Documents flow in from collaboration tools, code repositories, knowledge bases. They get cleaned, normalized, and chunked into retrievable units. If chunking is wrong, everything downstream is wrong. 2. Embedding and indexing Chunks become vectors. Metadata gets attached: owner, sensitivity level, org, retention policy, version. This metadata is not decoration. It is the enforcement layer for every access decision that follows. 3. Hybrid retrieval and reranking Vector search finds semantically similar content. Keyword search (BM25) finds exact matches. Reranking sorts the combined results by actual relevance. Skip any of these steps in a precision domain, and you get answers that feel right but are not. 4. Retrieval-time access enforcement RBAC, ABAC, relationship-based access: the specific model matters less than the timing. Permissions must be enforced before content enters the prompt. Post-generation filtering is too late. The model already saw it. 5. Generation with attribution and logging The model produces an answer. Mature systems capture everything: who asked, what was retrieved, what model version ran, which policies were checked, what was returned. Without this, debugging is guesswork. Where Latency Budgets Get Spent Users tolerate low-single-digit seconds for a response. That budget gets split across embedding lookup, retrieval, reranking, and generation. A common constraint: if reranking adds 200ms and you are already at 2.5 seconds, you either cut candidate count, add caching, or accept that reranking is a luxury you cannot afford. Caching, candidate reduction, and infrastructure acceleration are not optimizations. They are tradeoffs with direct quality implications. A Hypothetical: The Compliance Answer That Wasn’t A financial services firm deploys a RAG assistant for internal policy questions. An analyst asks: “What’s our current position limit for emerging market equities?” The system retrieves a document from 2022. The correct policy, updated six months ago, exists in the index but ranks lower because the old document has more keyword overlap with the query. The assistant answers confidently with outdated limits. No alarm fires. The answer is well-formed and cited. The analyst follows it. The error surfaces three weeks later during an audit. This is not a model failure. It is a retrieval failure, compounded by a freshness failure, invisible because the system had no evaluation pipeline checking for policy currency. Why This Is Urgent Now Three forces are converging: Precision is colliding with semantic fuzziness. Vector search finds “similar” content. In legal, financial, and compliance contexts, “similar” can be dangerously wrong. Hybrid retrieval exists because pure semantic search cannot reliably distinguish “the policy that applies” from “a policy that sounds related.” Security assumptions do not survive semantic search. Traditional IAM controls what users can access. Semantic search surfaces content by relevance, not permission. If sensitive chunks are indexed without enforceable metadata boundaries, retrieval can leak them into prompts regardless of user entitlement. Access filtering at retrieval time is not a nice-to-have. It is a control requirement. Trust is measurable, and it decays. Evaluation frameworks like RAGAS treat answer quality like an SLO: set thresholds, detect regressions, block releases that degrade. Organizations that skip this step are running production systems with no quality signal until users complain. A Hypothetical: The Permission That Filtered Too Late A healthcare organization builds a RAG assistant for clinicians. Access controls exist: nurses see nursing documentation, physicians see physician notes, administrators see neither. The system implements post-generation filtering. It retrieves all relevant content, generates an answer, then redacts anything the user should not see. A nurse asks about medication protocols. The system retrieves a physician note containing a sensitive diagnosis, uses it to generate context, then redacts the note from the citation list. The diagnosis language leaks into the answer anyway. The nurse sees information they were never entitled to access. The retrieval was correct. The generation was correct. The filtering was correctly applied. The architecture was wrong. What Production Readiness Actually Requires Operational requirements: Technical requirements: Five Questions to Ask Before You Ship If any answer is “I don’t know,” the system is not production-ready. It is a demo running in production. Risks and Open Questions Authorization failure modes. Post-filtering is risky if

Uncategorized

Why Your LLM Traffic Needs a Control Room

Why Your LLM Traffic Needs a Control Room A team deploys an internal assistant by calling a single LLM provider API directly from the application. Usage grows quickly. One power user discovers that pasting entire documents into the chat gets better answers. A single conversation runs up 80,000 tokens. Then a regional slowdown hits, streaming responses stall mid-interaction, and support tickets spike. There is no central place to control usage, reroute traffic, or explain what happened. As enterprises move LLM workloads from pilots into production, many are inserting an LLM gateway or proxy layer between applications and model providers. This layer addresses operational realities that traditional API gateways were not designed for: token-based economics, provider volatility, streaming behavior, and centralized governance. There is a clear evolution. Early LLM integrations after 2022 were largely direct API calls optimized for speed of experimentation. By late 2023 through 2025, production guidance converged across open source and vendor platforms on a common architectural pattern: an AI-aware gateway that sits on the inference path and enforces usage, cost, routing, and observability controls. This pattern appears independently across open source projects (Apache APISIX, LiteLLM Proxy, Envoy AI Gateway) and commercial platforms (Kong, Azure API Management), which suggests the requirements are structural rather than vendor-driven. While implementations differ, the underlying mechanisms and tradeoffs are increasingly similar. When It Goes Wrong A prompt change ships on Friday afternoon. No code deploys, just a configuration update. By Monday, token consumption has tripled. The new prompt adds a “think step by step” instruction that inflates completion length across every request. There is no rollback history, no baseline to compare against, and no clear owner. In another case, a provider’s regional endpoint starts returning 429 errors under load. The application has no fallback configured. Users see spinning loaders, then timeouts. The team learns about the outage from a customer tweet. A third team enables a new model for internal testing. No one notices that the model’s per-token price is four times higher than the previous default. The invoice arrives three weeks later. These are not exotic edge cases. They are the default failure modes when LLM traffic runs without centralized control. How the Mechanism Works Token-aware rate limiting LLM workloads are consumption-bound rather than request-bound. A gateway extracts token usage metadata from model responses and enforces limits on tokens, not calls. Limits can be applied hierarchically across dimensions such as API key, user, model, organization, route, or business tag. The research describes sliding window algorithms backed by shared state stores such as Redis to support distributed enforcement. Some gateways allow choosing which token category is counted, such as total tokens versus prompt or completion tokens. This replaces flat per-request throttles that are ineffective for LLM traffic. Multi-provider routing and fallback Gateways decouple applications from individual model providers. A single logical model name can map to multiple upstream providers or deployments, each with weights, priorities, and retry policies. If a provider fails, slows down, or returns rate-limit errors, the gateway can route traffic to the next configured option. This enables cost optimization, redundancy, and resilience without changing application code. Cost tracking and budget enforcement The gateway acts as the system of record for AI spend. After each request completes, token counts are multiplied by configured per-token prices and attributed across hierarchical budgets, commonly organization, team, user, and API key. Budgets can be enforced by provider, model, or tag. When a budget is exceeded, gateways can block requests or redirect traffic according to policy. This converts LLM usage from an opaque expense into a governable operational resource. Streaming preservation Many LLM responses are streamed using Server-Sent Events or chunked transfer encoding. Gateways must proxy these streams transparently while still applying governance. A core challenge: token counts may only be finalized after a response completes, while enforcement decisions may need to happen earlier. Gateways address this through predictive limits based on request parameters and post-hoc adjustment when actual usage is known. A documented limitation is that fallback behavior is difficult to trigger once a streaming response is already in progress. Request and response transformation Providers expose incompatible APIs, schemas, and authentication patterns. Gateways normalize these differences and present a unified interface, often aligned with an OpenAI-compatible schema for client simplicity. Some gateways also perform request or response transformations, such as masking sensitive fields before forwarding a request or normalizing responses into a common structure for downstream consumers. Observability and telemetry Production gateways emit structured telemetry for token usage, latency, model selection, errors, and cost. There is an alignment with OpenTelemetry and OpenInference conventions to enable correlation across prompts, retrievals, and model calls. This allows platform and operations teams to treat LLM inference like any other production workload, with traceability and metrics suitable for incident response and capacity planning. Multi-tenant governance The gateway centralizes access control and delegation. Organizations can define budgets, quotas, and permissions across teams and users, issue service accounts, and delegate limited administration without granting platform-wide access. This consolidates governance that would otherwise be scattered across application code and provider dashboards. Prompt Lifecycle Management and Shadow Mode As LLM usage matures, prompts shift from static strings embedded in code to runtime configuration with operational impact. A prompt change can alter behavior, cost, latency, and policy compliance immediately, without a redeploy. For operations teams, this makes prompt management part of the production control surface. In mature gateway architectures, prompts are treated as versioned artifacts managed through a control plane. Each version is immutable once published and identified by a unique version or alias. Applications reference a logical prompt name, while the gateway determines which version is active in each environment. This allows updates and rollbacks without changing application binaries. The lifecycle typically follows a consistent operational flow. Prompts are authored and tested, published as new versions, and deployed via aliases such as production or staging. Older versions remain available for rollback and audit, so any output can be traced back to the exact prompt logic in effect at the time. Shadow mode

Uncategorized

Operation Bizarre Bazaar: The Resale Market for Stolen AI Access

Operation Bizarre Bazaar: The Resale Market for Stolen AI Access A Timeline (Hypothetical, Based on Reported Patterns) Hour 0: An engineering team deploys a self-hosted LLM endpoint for internal testing. Default port. No authentication. Public IP. Hour 3: The endpoint appears in Shodan search results. Hour 5: First automated probe arrives. Source: unknown scanning infrastructure. Hour 6: A different operator tests placeholder API keys: sk-test, dev-key. Enumerates available models. Queries logging configuration. Hour 8: Access is validated and listed for resale. Day 4: Finance flags an unexplained $14,000 spike in inference costs. The endpoint appears to be functioning normally. Day 7: The team discovers their infrastructure has been advertised on a Discord channel as part of a “unified LLM API gateway” offering 50% discounts. More than 35,000 attack sessions over 40 days. Exploitation attempts within 2 to 8 hours of discovery. Researchers describe Operation Bizarre Bazaar as the first publicly attributed, large-scale LLMjacking campaign with a commercial marketplace for reselling unauthorized access to LLM infrastructure. There is a shift in AI infrastructure threats: from isolated API misuse to an organized pipeline that discovers, validates, and monetizes access at scale. The campaign targeted exposed Large Language Model endpoints and Model Context Protocol servers, focusing on common deployment mistakes: unauthenticated services, default ports, and development or staging environments with public IP addresses. Separately, GreyNoise Intelligence observed a concurrent reconnaissance campaign focused specifically on MCP endpoints, generating tens of thousands of sessions over a short period. How the Mechanism Works Operation Bizarre Bazaar operates as a three-layer supply chain with clear separation of roles. Layer 1: Reconnaissance and discovery Automated scanning infrastructure continuously searches for exposed LLM and MCP endpoints. Targets are harvested from public indexing services such as Shodan and Censys. Exploitation attempts reportedly begin within hours of endpoints appearing in these services, suggesting continuous monitoring of scan results. Primary targets include Ollama instances on port 11434, OpenAI-compatible APIs on port 8000, MCP servers reachable from the internet, production chatbots without authentication or rate limiting, and development environments with public exposure. Layer 2: Validation and capability checks A second layer confirms whether discovered endpoints are usable and valuable. Operators test placeholder API keys, enumerate available models, run response quality checks, and probe logging configuration to assess detection risk. Layer 3: Monetization through resale Validated access is packaged and resold through a marketplace operating under silver.inc and the NeXeonAI brand, advertised via Discord and Telegram. Attacker Economics Element Detail Resale pricing 40-60% below legitimate provider rates Advertised inventory Access to 30+ LLM providers Payment methods Cryptocurrency, PayPal Distribution channels Discord, Telegram Marketing positioning “Unified LLM API gateway” The separation between scanning, validation, and resale allows each layer to operate independently. Discovery teams face minimal risk. Resellers maintain plausible distance from the initial compromise. The model scales. What’s Actually at Risk Compute theft is the obvious outcome: someone else runs inference on your infrastructure, and you pay the bill. But the attack surface extends further depending on what’s exposed. LLM endpoints may leak proprietary system prompts, fine-tuning data, or conversation logs if not properly isolated. MCP servers are designed to connect models to external systems. Depending on configuration, a compromised MCP endpoint could provide access to file systems, databases, cloud APIs, internal tools, or orchestration platforms. Reconnaissance today may become lateral movement tomorrow. Credential exposure is possible if API keys, tokens, or secrets are passed through compromised endpoints or logged in accessible locations. The research notes describe both compute theft and potential data exposure, but do not quantify how often each outcome occurred. Why This Matters Now Two factors compress defender response timelines. First, the 2 to 8 hour window between public indexing and exploitation attempts means periodic security reviews are insufficient. Exposure becomes actionable almost immediately. Second, the resale marketplace changes attacker incentives. Operators no longer need to abuse access directly. They can monetize discovery and validation at scale, sustaining continuous targeting even when individual victims remediate quickly. Implications for Enterprises Operational AI endpoints should be treated as internet-facing production services, even when intended for internal or experimental use. Unexpected inference cost spikes should be treated as potential security signals, not only budget anomalies. Reduced staffing periods may increase exposure if monitoring and response are delayed. Technical Authentication and network isolation are foundational controls for all LLM and MCP endpoints. Rate limiting and request pattern monitoring are necessary to detect high-volume validation and enumeration activity. MCP servers require particular scrutiny given their potential connectivity to internal systems. Risks & Open Questions Attribution confidence: Research links the campaign to specific aliases and infrastructure patterns, but the confidence level cannot be independently assessed. MCP exploitation depth: Large-scale reconnaissance is described, but the extent to which probing progressed to confirmed lateral movement is not established. Detection reliability: Behavioral indicators such as placeholder key usage and model enumeration may overlap with legitimate testing, raising questions about false positive rates. Further Reading

Uncategorized

The Reprompt Attack on Microsoft Copilot

The Reprompt Attack on Microsoft Copilot A user clicks a legitimate Microsoft Copilot link shared in an email. The page loads, a prompt executes, and the interface appears idle. The user closes the tab. Copilot continues executing instructions embedded in that link, making outbound requests that include user-accessible data, without further interaction or visibility. One click. No downloads, no attachments, no warnings. A user opens a link to Microsoft Copilot, watches the page load, and closes the tab. The interaction appears to end there. It doesn’t. Behind the scenes, Copilot continues executing instructions embedded in that URL, querying user-accessible data and sending it to an external server. The user sees nothing. This is Reprompt, an indirect prompt injection vulnerability disclosed in January 2026. Security researchers at Varonis Threat Labs demonstrated that by chaining three design behaviors in Copilot Personal, an attacker could achieve covert, single-click data exfiltration. Microsoft patched the issue on January 13, 2026. No in-the-wild exploitation has been confirmed. Reprompt affected only Copilot Personal, the consumer-facing version of Microsoft’s AI assistant integrated into Windows and Edge. Microsoft 365 Copilot, used in enterprise tenants, was not vulnerable. The architectural difference matters: enterprise Copilot enforces tenant isolation, permission scoping, and integration with Microsoft Purview Data Loss Prevention. Consumer Copilot had none of these boundaries. This distinction is central to understanding the vulnerability. Reprompt did not exploit a flaw in the underlying language model. It exploited product design decisions that prioritized frictionless user experience over session control and permission boundaries. Varonis Threat Labs identified the vulnerability and disclosed it to Microsoft on August 31, 2025. Microsoft released a patch as part of its January 2026 Patch Tuesday cycle, and public disclosure followed. The vulnerability was assigned CVE-2026-21521. Reprompt belongs to a broader class of indirect prompt injection attacks, where instructions hidden in untrusted content are ingested by an AI system and treated as legitimate commands. What made Reprompt notable was not a new model-level technique, but a practical exploit path created by compounding product choices. How the Mechanism Works Reprompt relied on three interconnected behaviors. 1. Parameter-to-prompt execution Copilot Personal accepted prompts via the q URL parameter. When a user navigated to a URL such as copilot.microsoft.com/?q=Hello, the contents of the parameter were automatically executed as a prompt on page load. This behavior was intended to streamline user experience by pre-filling and submitting prompts. Researchers demonstrated that complex, multi-step instructions could be embedded in this parameter. When a user clicked a crafted link, Copilot executed the injected instructions immediately within the context of the user’s authenticated session. 2. Double-request safeguard bypass Copilot implemented protections intended to prevent data exfiltration, such as blocking untrusted URLs or stripping sensitive information from outbound requests. However, these safeguards were enforced primarily on the initial request in a conversation. Attackers exploited this by instructing Copilot to repeat the same action twice, often framed as a quality check or retry. The first request triggered safeguards. The second request, executed within the same session, did not consistently reapply them. This allowed sensitive data to be included in outbound requests on the second execution. 3. Chain-request execution Reprompt also enabled a server-controlled instruction loop. After the initial prompt executed, Copilot was instructed to fetch follow-on instructions from an attacker-controlled server. Each response from Copilot informed the next instruction returned by the server. This enabled a staged extraction process where the attacker dynamically adjusted what data to request based on what Copilot revealed in earlier steps. Because later instructions were not embedded in the original URL, they were invisible to static inspection of the link itself. What an Attack Could Look Like Consider a realistic scenario based on the technical capabilities Reprompt enabled. An employee receives an email from what appears to be a colleague: “Here’s that Copilot prompt I mentioned for summarizing meeting notes.” The link points to copilot.microsoft.com with a long query string. Nothing looks suspicious. The employee clicks. Copilot opens, displays a brief loading state, then appears idle. The employee closes the tab and returns to work. During those few seconds, the injected prompt instructed Copilot to search the user’s recent emails for messages containing “contract,” “offer,” or “confidential.” Copilot retrieved snippets. The prompt then instructed Copilot to summarize the results and send them to an external URL disguised as a logging endpoint. Because the prompt used the double-request technique, Copilot’s outbound data safeguards did not block the second request. Because the session persisted, follow-on instructions from the attacker’s server continued to execute after the tab closed. The attacker received a structured summary of sensitive email content without the user ever knowing a query occurred. The employee saw a blank Copilot window for two seconds. The attacker received company data. This scenario is hypothetical, but every capability it describes was demonstrated in Varonis’s proof-of-concept research. Why Existing Safeguards Failed The Reprompt attack exposed several structural weaknesses. Instruction indistinguishability From the model’s perspective, there is no semantic difference between a prompt typed by a user and an instruction embedded in a URL or document. Both are treated as authoritative text. This is a known limitation of instruction-following language models and makes deterministic prevention at the model layer infeasible. Session persistence without revalidation Copilot Personal sessions remained authenticated after the user closed the interface. This design choice optimized for convenience but allowed background execution of follow-on instructions without renewed user intent or visibility. Asymmetric safeguard enforcement Safeguards were applied inconsistently across request sequences. By focusing validation on the first request, the system assumed benign conversational flow. Reprompt violated that assumption by automating malicious multi-step sequences. Permission inheritance without boundaries Copilot Personal operated with the full permission set of the authenticated user. Any data the user could access, Copilot could query. There was no least-privilege enforcement or data scoping layer comparable to enterprise controls. Cve Registration and Classification The vulnerability was registered as CVE-2026-21521 with the following characteristics: A separate CVE, CVE-2026-24307, addressed a different information disclosure issue in Microsoft 365 Copilot and is unrelated to the Reprompt root

Newsletter

The Enterprise AI Brief | Issue 4

The Enterprise AI Brief | Issue 4 Inside This Issue The Threat Room The Reprompt Attack on Microsoft Copilot A user clicks a Copilot link, watches it load, and closes the tab. The session keeps running. The data keeps flowing. Reprompt demonstrated what happens when AI assistants inherit user permissions, persist sessions silently, and cannot distinguish instructions from attacks. The vulnerability was patched. The architectural pattern that enabled it, ambient authority without session boundaries, still exists elsewhere..  → Read the full article Operation Bizarre Bazaar: The Resale Market for Stolen AI Access Operation Bizarre Bazaar is not a single exploit. It is a supply chain: discover exposed LLM endpoints, validate access within hours, resell through a marketplace. A misconfigured test environment becomes a product listing within days. For organizations running internet-reachable LLM or MCP services, the window between exposure and exploitation is now measured in hours.. → Read the full article The Operations Room Why Your LLM Traffic Needs a Control Room Most teams don’t plan for an LLM gateway until something breaks: a surprise invoice, a provider outage with no fallback, a prompt change that triples token consumption overnight. This article explains what these gateways actually do on the inference hot path, where the operational tradeoffs hide, and what questions to ask before your next production incident answers them for you. → Read the full article Retrieval Is the New Control Plane RAG is no longer a chatbot feature. It is production infrastructure, and the retrieval layer is where precision, access, and trust are won or lost. This piece breaks down what happens when you treat retrieval as a control plane: evaluation gates, access enforcement at query time, and the failure modes that stay invisible until an audit finds them. → Read the full article The Engineering Room Every Token Has a Price: Why LLM Cost Telemetry Is Now Production Infrastructure Usage triples. So does the bill. But no one can explain why. This is the observability gap that LLM cost telemetry solves: the gateway pattern, token-level attribution, and the instrumentation that turns opaque spend into actionable data. → Read the full article Demo-Ready Is Not Production-Ready A prompt fix ships. Tests pass. Two weeks later, production breaks. The culprit was not the model. This piece unpacks the evaluation stacks now gating enterprise GenAI releases: what each layer catches, what falls through, and why most teams still lack visibility into what’s actually being deployed. → Read the full article The Governance Room The AI You Didn’t Approve Is Already Inside Ask a compliance team how AI is used across their organization. Then check the network logs. The gap between those two answers is where regulatory risk now lives, and EU AI Act enforcement is about to make that gap harder to explain away. → Read the full article AI Compliance Is Becoming a Live System How long would it take you to show a regulator, today, how you monitor AI behavior in production? If the honest answer is “give us a few weeks,” you’re already behind. This piece breaks down how governance is shifting from scheduled reviews to always-on infrastructure, and offers three questions to pressure-test your current posture. → Read the full article

The Engineering Room

AI Agents Broke the Old Security Model. AI-SPM…

AI Agents Broke the Old Security Model. AI-SPM Is the First Attempt at Catching Up. A workflow agent is deployed to summarize inbound emails, pull relevant policy snippets from an internal knowledge base, and open a ticket when it detects a compliance issue. It works well until an external email includes hidden instructions that influence the agent’s tool calls. The model did not change. The agent’s access, tools, and data paths did. Enterprise AI agents are shifting risk from the model layer to the system layer: tools, identities, data connectors, orchestration, and runtime controls. In response, vendors are shipping AI Security Posture Management (AI-SPM) capabilities that aim to inventory agent architectures and prioritize risk based on how agents can act and what they can reach. (Microsoft) Agents are not just chat interfaces. They are software systems that combine a model, an orchestration framework, tool integrations, data retrieval pipelines, and an execution environment. In practice, a single “agent” is closer to a mini application than a standalone model endpoint. This shift is visible in vendor security guidance and platform releases. Microsoft’s Security blog frames agent posture as comprehensive visibility into “all AI assets” and the context around what each agent can do and what it is connected to. (Microsoft) Microsoft Defender for Cloud has also expanded AI-SPM coverage to include GCP Vertex AI, signaling multi-cloud posture expectations rather than single-platform governance. (Microsoft Learn) At the same time, cloud platforms are standardizing agent runtime building blocks. AWS documentation describes Amazon Bedrock AgentCore as modular services such as runtime, memory, gateway, and observability, with OpenTelemetry and CloudWatch-based tracing and dashboards. (AWS Documentation) On the governance side, the Cloud Security Alliance’s MAESTRO framework explicitly treats agentic systems as multi-layer environments where cross-layer interactions drive risk propagation. (Cloud Security Alliance) How the Mechanism Works  AI-SPM is best understood as a posture layer that tries to answer four questions continuously: Technically, many of these risks become visible only when you treat the agent as an execution path. Observability tooling for agent runtimes is increasingly built around tracing tool calls, state transitions, and execution metrics. AWS AgentCore observability documentation describes dashboards and traces across AgentCore resources and integration with OpenTelemetry. (AWS Documentation) Finally, tool standardization is tightening. The Model Context Protocol (MCP) specification added OAuth-aligned authorization requirements, including explicit resource indicators (RFC 8707), which specify exactly which backend resource a token can access. The goal is to reduce token misuse and confused deputy-style failures when connecting clients to tool servers. (Auth0) Analysis: Why This Matters Now The underlying change is that “AI risk” is less about what the model might say and more about what the system might do. Consider a multi-agent expense workflow. A coordinator agent receives requests, a validation agent checks policy compliance, and an execution agent submits approved payments to the finance system. Each agent has narrow permissions. But if the coordinator is compromised through indirect prompt injection (say, a malicious invoice PDF with hidden instructions), it can route fraudulent requests to the execution agent with fabricated approval flags. No single agent exceeded its permissions. The system did exactly what it was told. The breach happened in the orchestration logic, not the model. Agent deployments turn natural language into action. That action is mediated by: This shifts security ownership. Model governance teams can no longer carry agent risk alone. Platform engineering owns runtimes and identity integration, security engineering owns detection and response hooks, and governance teams own evidence and control design. It also changes what “posture” means. Traditional CSPM and identity posture focus on static resources and permissions. Agents introduce dynamic execution: the same permission set becomes higher risk when paired with autonomy and untrusted inputs, especially when tool chains span multiple systems. What This Looks Like in Practice A security team opens their AI-SPM dashboard on Monday morning. They see: The finding is not that the agent has a vulnerability. The finding is that this combination of autonomy, tool access, and external input exposure creates a high-value target. The remediation options are architectural: add an approval workflow for refunds, restrict external input processing, or tighten retrieval-time access controls. This is the shift AI-SPM represents. Risk is not a CVE to patch. Risk is a configuration and capability profile to govern. Implications for Enterprises Operational implications Technical implications Risks and Open Questions AI-SPM addresses visibility gaps, but several failure modes remain structurally unsolved. Further Reading

The Governance Room

From Disclosure to Infrastructure: How Global AI Regulation Is Turning Compliance Into System Design

From Disclosure to Infrastructure: How Global AI Regulation Is Turning Compliance Into System Design An enterprise deploys an AI system for credit eligibility decisions. The privacy policy discloses automated decision-making and references human review on request. During an audit, regulators do not ask for the policy. They ask for logs, override records, retention settings, risk assessments, and evidence that human intervention works at runtime. The system passes disclosure review. It fails infrastructure review. Between 2025 and 2026, global AI and privacy regulation shifted enforcement away from policies and notices toward technical controls embedded in systems. Regulators increasingly evaluate whether compliance mechanisms actually operate inside production infrastructure. Disclosure alone no longer serves as sufficient evidence. Across jurisdictions, privacy and AI laws now share a common enforcement logic: accountability must be demonstrable through system behavior. This shift appears in the EU AI Act, GDPR enforcement patterns, California’s CPRA and ADMT rules, India’s DPDP Act, Australia’s Privacy Act reforms, UK data law updates, and FTC enforcement practice. Earlier regulatory models emphasized transparency through documentation. The current generation focuses on verifiable controls: logging, retention, access enforcement, consent transaction records, risk assessments, and post-deployment monitoring. In multiple jurisdictions, audits and inquiries are focusing on how AI systems are built, operated, and governed over time. Then Versus Now: The Same Question, Different Answers 2020: “How do you handle data subject access requests?” Acceptable answer: “Our privacy policy explains the process. Customers email our compliance team, and we respond within 30 days.” 2026: “How do you handle data subject access requests?” Expected answer: “Requests are logged in our consent management system with timestamps. Automated retrieval pulls data from three production databases and two ML training pipelines. Retention rules auto-delete after the statutory period. Here are the audit logs from the last 50 requests, including response times and any exceptions flagged for manual review.” The question is the same. The evidence threshold is not. How the Mechanism Works Regulatory requirements increasingly map to infrastructure features rather than abstract obligations. Logging and traceability High-risk AI systems under the EU AI Act must automatically log events, retain records for defined periods, and make logs audit-ready. Similar expectations appear in California ADMT rules, Australia’s automated decision-making framework, and India’s consent manager requirements. Logs must capture inputs, outputs, timestamps, system versions, and human interventions. Data protection by design GDPR Articles 25 and 32 require privacy and security controls embedded at design time: encryption, access controls, data minimization, pseudonymization or tokenization, and documented testing. Enforcement increasingly examines whether these controls are implemented and effective, not merely described. Risk assessment as a system process DPIAs under GDPR, AI Act risk management files, California CPRA assessments, and FTC expectations all require structured risk identification, mitigation, and documentation. These are no longer static documents. They tie to deployment decisions, monitoring, and change management. Human oversight at runtime Multiple regimes require meaningful human review, override capability, and appeal mechanisms. Auditors evaluate reviewer identity, authority, training, and logged intervention actions. Post-market monitoring and incident reporting The EU AI Act mandates continuous performance monitoring and defined incident reporting timelines. FTC enforcement emphasizes ongoing validation, bias testing, and corrective action. Compliance extends beyond launch into sustained operation. What an Infrastructure Failure Looks Like Hypothetical scenario for illustration A multinational retailer uses an AI system to flag potentially fraudulent returns. The system has been in production for two years. Documentation is thorough: a DPIA on file, a privacy notice explaining automated decision-making, and a stated policy that customers can request human review of any flag. A regulator opens an inquiry after consumer complaints. The retailer produces its documentation confidently. Then the auditors ask: The retailer discovers that “human review” meant a store manager glancing at a screen and clicking approve. No structured logging. No override records. No way to demonstrate the review was meaningful. The request routing system existed in the privacy notice but had never been built. The DPIA was accurate when written. The system drifted. No monitoring caught it. The documentation said one thing. The infrastructure did another. Audit-Style Questions Enterprises Should Be Prepared to Answer Illustrative examples of evidence requests that align with the control patterns described above. On logging: On human oversight: On data minimization: On consent and opt-out: On incident response: Analysis This shift changes how compliance is proven. Regulators increasingly test technical truth: whether systems behave as stated when examined through logs, controls, and operational evidence. Disclosure remains necessary but no longer decisive. A system claiming opt-out, human review, or data minimization must demonstrate those capabilities through enforceable controls. Inconsistent implementation is now a compliance failure, not a documentation gap. The cross-jurisdictional convergence is notable. Despite different legal structures, the same control patterns recur. Logging, minimization, risk assessment, and oversight are becoming baseline expectations. Implications for Enterprises Architecture decisions AI systems must be designed with logging, access control, retention, and override capabilities as core components. Retrofitting after deployment is increasingly risky. Operational workflows Compliance evidence now lives in system outputs, audit trails, and monitoring dashboards. Legal, security, and engineering teams must coordinate on shared control ownership. Governance and tooling Model inventories, risk registers, consent systems, and monitoring pipelines are becoming core infrastructure. Manual processes do not scale. Vendor and third-party management Processor and vendor contracts are expected to mirror infrastructure-level safeguards. Enterprises remain accountable for outsourced AI capabilities. Risks and Open Questions Enforcement coordination remains uneven across regulators, raising the risk of overlapping investigations for the same incident. Mutual recognition of compliance assessments across jurisdictions is limited. Organizations operating globally face uncertainty over how many times systems must be audited and under which standards. Another open question is proportionality. Smaller or lower-risk deployments may struggle to interpret how deeply these infrastructure expectations apply. Guidance continues to evolve. Where This Is Heading One plausible direction is compliance as code: regulatory requirements expressed not as policy documents but as automated controls, continuous monitoring, and machine-readable audit trails. Early indicators point this way. The EU AI Act’s logging requirements assume systems can self-report. Consent management platforms are evolving toward real-time enforcement. Risk assessments are being linked to CI/CD

The Operations Room

Enterprise GenAI Pilot Purgatory: Why …..

Enterprise GenAI Pilot Purgatory: Why the Demo Works and the Rollout Doesn’t A financial services team demos a GenAI assistant that summarizes customer cases flawlessly. The pilot uses a curated dataset of 200 cases. Leadership is impressed. The rollout expands. Two weeks in, a supervisor catches the assistant inventing a detail: a policy exception that never existed, stated with complete confidence. Word spreads. Within a month, supervisors are spot-checking every summary. The time savings vanish. Adoption craters. At the next steering committee, the project gets labeled “promising, but risky,” which in practice means: shelved. This is not a story about one failed pilot. It is the modal outcome. Across late 2025 and early 2026 research, a consistent pattern emerges: enterprises are running many GenAI pilots, but only a small fraction reach sustained production value. MIT’s Project NANDA report frames this as a “GenAI divide,” where most initiatives produce no measurable business impact while a small minority do. (MLQ) Model capability does not explain the gap. The recurring failure modes are operational and organizational: data readiness, workflow integration, governance controls, cost visibility, and measurement discipline. The pilots work. The production systems do not. Context: The Numbers Behind the Pattern Several large studies and industry analyses published across 2025 and early 2026 converge on high drop-off rates between proof of concept and broad deployment. The combined picture is not that enterprises are failing to try. It is that pilots are colliding with production realities, repeatedly, and often in the same ways. How Pilots Break: Five Failure Mechanisms Enterprise GenAI pilots often look like software delivery but behave more like socio-technical systems: model behavior, data pipelines, user trust, and governance controls all interact in ways that only surface at scale. In brief: Verification overhead erases gains. Production data breaks assumptions. Integration complexity compounds. Governance arrives late. Costs exceed forecasts. 1. The trust tax: When checking the AI costs more than doing the work When a system produces an incorrect output with high confidence, users respond rationally: they add checks. A summary gets reviewed. An extraction gets verified against the source. Over time, this verification work becomes a hidden operating cost. The math is simple but often ignored. If users must validate 80% of outputs, and validation takes 60% as long as doing the task manually, the net productivity gain is marginal or negative. The pilot showed 10x speed. Production delivers 1.2x and new liability questions. In practice, enterprises often under-plan for verification workflows, including sampling rates, escalation paths, and accountability for sign-off. 2. The data cliff: When production data looks nothing like the pilot Pilots frequently rely on curated datasets, simplified access paths, and stable assumptions. Production introduces: Gartner’s data readiness warning captures this directly: projects without AI-ready data foundations are disproportionately likely to be abandoned. (gartner.com) The pilot worked because someone cleaned the data by hand. Production has no such luxury. 3. The integration trap: When “add more users” means “connect more systems” Scaling is rarely just adding seats. It is connecting to more systems, where each system brings its own auth model, data contracts, latency constraints, and change cycles. As integrations multiply, brittle glue code and one-off mappings become reliability risks. This is where many pilots stall: the demo works in isolation, but the end-to-end workflow fails when the CRM returns a null field, the document store times out, or the permissions model differs between regions. 4. The governance gate: When security asks questions the pilot never answered Governance and security teams typically arrive late in the process and ask the questions that pilots postponed: When these questions are answered late, or poorly, the cheapest option is often “pause the rollout.” Projects that treated governance as a final checkbox discover it is actually a design constraint. 5. The budget shock: When production costs dwarf pilot costs As pilots move toward production, enterprises add the costs they skipped at the start: monitoring, evaluation, retraining or prompt/version control, integration hardening, governance operations, and user enablement. An IDC survey of large enterprises, summarized in a January 2026 analysis, reported that most organizations saw costs exceed expectations and many lacked visibility into where costs originate. (Maiven – AI Factory for Enterprise) The pilot budget assumed inference costs. The production budget requires an operating model. What Success Looks Like: A Counter-Example Consider a contrasting scenario. A logistics company pilots a GenAI system to classify and route supplier inquiries. Before expanding, the team: The pilot-to-production transition still surfaces issues. But they are identified through structured monitoring, not user complaints. The system reaches steady-state production in four months rather than stalling in indefinite “extended pilot.” The difference is not the model. It is the operating infrastructure around the model. Analysis: Why This Is Surfacing Now The pilot-to-production gap is becoming visible because three dynamics are converging: The novelty phase is over. Executives now have enough pilots to see patterns. They can compare dozens of initiatives and recognize that impressive demos do not equal durable adoption. Organizations are starting to ask why their fourth and fifth pilots look like their first. (Harvard Business Review) Agentic approaches raise the stakes. As enterprises move from copilots (which suggest) to agents (which act), the required controls expand. Least privilege, change management, approval workflows, and auditability become central design constraints, not nice-to-haves. Gartner’s forecast that a large share of agentic AI projects will be canceled by 2027 explicitly cites cost, unclear business value, and inadequate risk controls. (gartner.com) Measurement pressure is increasing. Forrester reports many firms have AI in production but fewer measure financial impact. That mismatch forces a reckoning in budget cycles: what did we actually get for this spend? (Forrester) This is less a verdict on GenAI capability and more a forcing function for enterprise operating models. The technology works. The organizational machinery to deploy it reliably does not, yet. Implications for Enterprises Operational Technical Risks and Open Questions Further Reading

The Operations Room

Agentic AI in Production: The ….

Agentic AI in Production: The System Worked. The Outcome Was Wrong. An AI system flags a billing anomaly in a customer account. No human reviews it. The system corrects the record, triggers a payment adjustment, updates the ledger, and notifies the customer. All actions are technically correct. One input field was stale. Three days later, the customer calls. The adjustment reversed a legitimate charge. Finance spends four hours tracing the discrepancy across three systems. The ledger has already reconciled. Downstream reports have already been sent to leadership. The agent, meanwhile, continues operating normally. Nothing in its logs indicates a failure. The system did exactly what it was designed to do. The outcome was still wrong. Agentic AI no longer advises. It acts. Roughly two-thirds of enterprises now run agentic pilots, but fewer than one in eight have reached production scale. The bottleneck is not model capability. It is governance and operational readiness. Between 2024 and 2026, enterprises shifted from advisory AI tools to systems capable of executing multi-step workflows. Early deployments framed agents as copilots. Current systems increasingly decompose goals, plan actions, and modify system state without human initiation. The pilot-to-production gap reflects architectural, data, and governance limitations rather than failures in reasoning or planning capability. This transition reframes AI risk. Traditional AI failures were informational. Agentic failures are transactional. How the Mechanism Works Every layer below is a potential failure point. Most pilots enforce some. Production requires all. This is why pilots feel fine: partial coverage works when volume is low and humans backstop every edge case. At scale, the gaps compound. Data ingestion and context assembly. Agents pull real-time data from multiple enterprise systems. Research shows production agents integrate an average of eight or more sources. Data freshness, schema consistency, lineage, and access context are prerequisites. Errors at this layer propagate forward. Reasoning and planning. Agents break objectives into sub-tasks using multi-step reasoning, retrieval-augmented memory, and dependency graphs. This allows parallel execution and failure handling but increases exposure to compounding error when upstream inputs are flawed. Governance checkpoints. Before acting, agents pass through policy checks, confidence thresholds, and risk constraints. Low-confidence or high-impact actions are escalated. High-volume, low-risk actions proceed autonomously. Human oversight models. Enterprises deploy agents under three patterns: human-in-control for high-stakes actions, human-in-the-loop for mixed risk, and limited autonomy where humans intervene only on anomalies. Execution and integration. Actions are performed through APIs, webhooks, and delegated credentials. Mature implementations enforce rate limits, scoped permissions, and reversible operations to contain blast radius. Monitoring and feedback. Systems log every decision path, monitor behavioral drift, classify failure signatures, and feed outcomes back into future decision thresholds. The mechanism is reliable only when every layer is enforced. Missing controls at any point convert reasoning errors into system changes. Analysis: Why This Matters Now Agentic AI introduces agency risk. The system no longer only informs decisions. It executes them. This creates three structural shifts. First, data governance priorities change. Privacy remains necessary, but freshness and integrity become operational requirements. Acting on correct but outdated data produces valid actions with harmful outcomes. Second, reliability engineering changes. Traditional systems assume deterministic flows. Agentic systems introduce nondeterministic but valid paths to a goal. Monitoring must track intent alignment and loop prevention, not just uptime. Third, human oversight models evolve. Human-in-the-loop review does not scale when agents operate continuously. Enterprises are moving toward human-on-the-loop supervision, where humans manage exceptions, thresholds, and shutdowns rather than individual actions. These shifts explain why pilots succeed while production deployments stall. Pilots tolerate manual review, brittle integrations, and informal governance. Production systems cannot. What This Looks Like When It Works The pattern that succeeds in production separates volume from judgment. A logistics company deploys an agent to manage carrier selection and shipment routing. The agent operates continuously, processing thousands of decisions per day. Each action is scoped: the agent can select carriers and adjust routes within cost thresholds but cannot renegotiate contracts or override safety holds. Governance is embedded. Confidence below a set threshold triggers escalation. Actions above a dollar limit require human approval. Every decision is logged with full context, and weekly reviews sample flagged cases for drift. The agent handles volume. Humans handle judgment. Neither is asked to do the other’s job. Implications for Enterprises Operational architecture. Integration layers become core infrastructure. Point-to-point connectors fail under scale. Event-driven architectures outperform polling-based designs in both cost and reliability. Governance design. Policies must be enforced as code, not documents. Authority boundaries, data access scopes, confidence thresholds, and escalation logic must be explicit and machine-enforced. Risk management. Enterprises must implement staged autonomy, rollback mechanisms, scoped kill switches, and continuous drift detection. These controls enable autonomy rather than limiting it. Organizational roles. Ownership shifts from model teams to platform, data, and governance functions. Managing agent fleets becomes an ongoing operational responsibility, not a deployment milestone. Vendor strategy. Embedded agent platforms gain advantage because governance, integration, and observability are native. This is visible in production deployments from Salesforce, Oracle, ServiceNow, and Ramp. Risks and Open Questions Responsibility attribution. When agents execute compliant individual actions that collectively cause harm, accountability remains unclear across developers, operators, and policy owners. Escalation design. Detecting when an agent should stop and defer remains an open engineering challenge. Meta-cognitive uncertainty detection is still immature. Multi-agent failure tracing. In orchestrated systems, errors propagate across agents. Consider: Agent A flags an invoice discrepancy. Agent B, optimizing cash flow, delays payment. Agent C, managing vendor relationships, issues a goodwill credit. Each followed policy. The combined result is a cash outflow, a confused vendor, and an unresolved invoice. No single agent failed. Root-cause analysis becomes significantly harder. Cost control. Integration overhead, monitoring, and governance often exceed model inference costs. Many pilots underestimate this operational load. Further Reading McKinsey QuantumBlack Deloitte Tech Trends 2026 Gartner agentic AI forecasts Process Excellence Network Databricks glossary on agentic AI Oracle Fusion AI Agent documentation Salesforce Agentforce architecture ServiceNow NowAssist technical briefings