The Hidden Token Cost of AI Tools

What Federal IT Leaders Need to Know Before Scaling

Federal agencies are moving quickly to adopt AI-powered development tools, driven by the promise of faster delivery and increased productivity. But many organizations are facing an unexpected challenge as they scale: costs are rising faster and less predictably than traditional cloud models would suggest.

In AI systems, those costs are largely driven by tokens—the units of text a model processes, including both what users send in and what the model generates in response.

What’s driving that increase isn’t developer usage alone. It’s the hidden system activity behind every request—context loading, tool orchestration, and session management—that most teams never see. While developers may enter a short prompt, the system often sends significantly more data behind the scenes.

At RIVA, we tracked usage across a development team using an AI Gateway (Claude on AWS Bedrock) over several months. What we found was striking: less than 1% of tokens came from developer input, while nearly all remaining usage (98.5%) came from tooling overhead—largely invisible without the right telemetry in place.

That hidden layer has real cost implications. In our case, Bedrock spend increased by more than 900% in just three months.

For federal IT leaders responsible for balancing innovation with fiscal discipline, this creates a new kind of financial operations (FinOps) problem—one where usage is dynamic, attribution is unclear, and traditional cost controls fall short.

This post shares what we learned in an effort to help federal IT leaders plan with clearer data, stronger visibility, and fewer surprises.

What Developers See vs. What the Model Receives
When a developer enters a simple request like “fix the bug in auth.js,” it appears small and straightforward. In reality, the request sent to the model is significantly larger.Before even beginning to generate a response, the tool may include system instructions, tool definitions, conversation history, file contents, and broader workspace context.

Where Do the Tokens Actually Go?

Production telemetry: 6,066 requests across 4 developers over 62 days

99.6%

is not what the
developer typed

Tooling Overhead 98.5%

System prompts, tool schemas, conversation
history, cached context, workspace state

Model Output 1.3%

The AI’s actual response plus
internal reasoning tokens

Developer Input 0.4%

What the developer actually typed
(avg 247 tokens per request)

Source: RIVA AI Gateway telemetry | Claude Sonnet 4.5 on AWS Bedrock | Agentic coding tools

Where the Tokens Actually Go
Our telemetry across 6,066 requests tells a consistent story:

98.5% of tokens come from tooling overhead

1.3% from the model’s response

0.4% from developer input

Different agentic tools can generate 2 to 5 times more overhead for the same task. Without telemetry into this layer, organizations are not measuring their AI spend—they’re estimating it.

What This Means for Cost
The productivity gains from agentic AI tools are real. The question is not whether to adopt them—it is how to budget for them with sufficient insight into actual consumption.

Subscription tools such as GitHub Copilot, Claude Cowork and Cursor bundle usage into flat pricing. This simplifies budgeting but limits transparency. API-based tools, by contrast, provide clearer insight into usage, which becomes essential as adoption grows.

What AI Coding Tools Actually Consume

Measured from a 62-day gateway pilot: 4 developers, 6,066 requests

$47

per developer / month
(gateway-measured, cached)

OBSERVATION ONLY — NO LIMITS

$157

per developer / month
(gateway-measured, uncached)

SAME USAGE, NO CACHE

But the full Bedrock bill told a different story

Oct ’25

$173

Nov ’25

$652

Dec ’25

$768

Jan ’26

$1,812

10x growth in 3 months — no limits were in place

Annual planning range at 50 developers (gateway-measured rate)

With caching

$28,200

Without caching

$94,200

Actual account costs may be 3-5x higher if usage bypasses the gateway

Industry context (published data)

GitHub Copilot Business $19/user/month (subscription)

Cursor Pro $20/user/month (subscription)

      RIVA gateway (measured)
      $47/dev/month (API, cached)
    

RIVA total Bedrock (account-level) $175+/dev/month

Claude Code avg (Anthropic published) ~$180/month

Heavy agentic usage (reported) $800-$2,800/month

Sources: RIVA AI Gateway telemetry | AWS Bedrock billing | Anthropic published data | GitHub, Cursor pricing pages

What AI Spend Actually Looks Like in Practice

RIVA’s initial approach relied on direct Bedrock access, where reporting was available but fragmented across cloud-native logs, billing data, and manual reconciliation. That made it difficult to attribute usage consistently at the per-developer or per-workflow level, or to apply practical team-level controls without an additional gateway layer.

Introducing an AI Gateway changed that.

Over a 62-day period, we routed a four-developer team through the gateway and observed costs of $47 per developer per month with caching enabled. Without caching, the same usage would have reached $157 per developer per month.

The gateway functioned purely as an observation layer—providing per-request and per-developer attribution that the cloud bill alone could not offer.

Across the broader AWS Bedrock account—including usage outside the gateway—costs grew from $173 per month to over $1,800 per month in three months. When we compared gateway data with the full cloud bill, the effective cost exceeded $175 per developer per month.

The difference was not caused by gateway controls, but by gaps in visibility. Usage outside the gateway was not apparent until reconciled against the invoice.

At scale, this becomes meaningful quickly. At 50 developers, even a modest $47 per developer per month translates to $28,200 annually with caching, or $94,200 without it. If total usage is not fully visible, the actual cost can be significantly higher.

The Cache Tax on Real Work Patterns
Cloud providers can can use caching—storing repeated prompt content so it can be reused instead of processed from scratch—to reduce cost. But the savings depend on cache settings and whether work resumes before the cache expires. When the cache is no longer active, the repeated context must be rebuilt, which can sharply increase the cost of an otherwise routine session.

To understand the impact, our team modeled an 80-turn, 3-hour coding session using Bedrock’s default 5-minute cache window. We included three common interruptions: a code review, a lunch break, and a standup. In each case, the pause caused the cache to expire, forcing the next turn to rebuild context rather than continue at the lower cached rate.

How Normal Breaks Create Cost Spikes

An 80-turn, 3-hour coding session with a 5-minute cache window

$5.42

Session cost
(with caching)

$26.05

Same session
(without caching)

79%

Caching
savings

$1.33

Cache rebuild cost
(24.6% of total)

7.1x

Rebuild turn vs.
steady-state turn

Scroll or drag to explore the full session →

Three breaks longer than 5 minutes triggered $1.33 in cache rebuilds, nearly a quarter of the session’s total cost. The red spikes are the price of a coffee break, a code review, or a standup meeting. As context grows through the session, each rebuild gets more expensive.

Source: Modeled session | Claude Sonnet 4.5 on AWS Bedrock | 5-min cache TTL

Those rebuilds added 24.6% to the total session cost. Each rebuild turn costs more than seven times a steady-state turn.

As sessions grow longer, costs compound in ways that are not always intuitive. When context is lost, output quality can decline, leading to rework and additional token usage.

Why This Matters for Federal Organizations
For federal IT leaders, the risk is not just AI adoption. It is adopting without cost attribution, usage visibility, and governance.

Traditional cloud FinOps models were built around predictable consumption. Agentic AI introduces a more dynamic model, where the system determines how much compute, or processing power, is used.

This introduces several key risks:

Uncontrolled spend without per-developer attribution

Shadow usage outside procurement visibility

Misaligned tool selection based on features rather than cost

Mission impact when unplanned AI costs divert funding from mission priorities

What to Do Next
You do not need to solve AI governance all at once. Start with a 30-day gateway pilot across a small team. This provides baseline data for budgeting, attribution for procurement, and insight into usage patterns. Then reconcile that data against your cloud bill—because differences between the two often reveal the most important insights.

At RIVA, we’re taking this a step further by introducing a hybrid model: using subscription-based tools for predictable usage, with pay-as-you-go access to Bedrock models for more demanding workloads, all governed through a single gateway. This creates a controlled baseline while preserving flexibility as usage scales.

The key is visibility. Without it, organizations are not measuring AI cost, they’re estimating it. The hidden token cost of AI tools is not a reason to slow adoption—it is a reason to plan with better data.

Want to learn more about how RIVA is approaching AI governance?
Reach out Bernie Pineau to start a conversation.

This post is Part 1 of a three-part series on governing AI cost in federal environments. In Part 2, we’ll move from visibility to control—breaking down how to track usage, attribute costs across teams and tools, and establish the governance layer required to manage AI consumption at scale.

Thought Leadership

What Federal IT Leaders Need to Know Before Scaling

Related Articles

RIVA Introduces FuseFlow AI to Improve Federal Digital Delivery

Discussing the Executive Order on Digital Design with RIVA HCD Experts

Going from UX Design to Front-End Drupal Code Using GenAI

Enhancing UX Research Efficiency: What AI Can (and Can’t) Do

RIVA Introduces FuseFlow AI to Improve Federal Digital Delivery