occ-stack / reports /blog_post.md
narcolepticchicken's picture
Upload reports/blog_post.md
2f3b1f2 verified

OCC: Making AI Agents Earn Their Compute

The problem with AI agents today

Every time an AI agent makes a tool call, runs a verifier, or speaks in a debate, it costs real money. GPUs aren't free. But today's agent systems allocate compute uniformly β€” every agent gets equal turns, every retrieval call costs the same budget, every debate round burns the same GPU-seconds.

This is like giving every employee in a company the same salary regardless of what they produce. Inevitably, some agents produce garbage while consuming the same resources as high-performing ones.

Introducing OCC: Oracle-Credit-Compute

OCC is a system where AI agents earn credits by proving their actions actually help. Think of it as a micro-economy inside your AI system:

  1. Impact Oracle: A rule-based scorer that evaluates whether an agent action produced measurable value. No neural network β€” which means no self-reinforcing bias loops.

  2. Credit Ledger: Credits are non-transferable (no laundering), decay over time (no hoarding), and are scoped to specific capabilities (retrieval credits β‰  write credits). Every transaction is logged with provenance.

  3. Resource Broker: Gates access to expensive operations. An agent with retrieval credits can't use them for shell execution. The broker has 6 decision levels: ALLOW, DENY, REQUIRE_APPROVAL, DOWNGRADE, ESCALATE, ASK_JUSTIFICATION.

  4. GRPO Reward Hook: Compatible with reinforcement learning (TRL's GRPO trainer). The reward formula balances correctness, evidence support, calibration, abstention utility, and resource cost, while penalizing confident-wrong answers and gaming.

Real Results

Code Generation (HumanEval)

Using Qwen3-Coder-30B-A3B-Instruct (30B MoE, 3.3B active params, Apache 2.0):

Strategy Pass@1 Tokens
OCC tiered (128β†’1024 tokens) 75.0% (123/164) 21,043
Fixed budget (1024 tokens) 75.0% 167,936
Savings 87.5%

62.8% of HumanEval problems are solvable in just 128 tokens. Why burn 1024 tokens on every problem when most need a fraction of that?

Multi-Agent Debate

30 factual yes/no questions, 3 honest + 1 adversarial agent per topic:

Method Accuracy Tokens
Equal turns (1 round) 53.3% 61,440
OCC credit allocation (3 rounds with broker) 83.3% 138,752

56% higher accuracy. The broker denied 12 low-quality agent-turns β€” agents who couldn't earn credits got shut out.

Anti-Gaming: 100% Detection

Non-transferable + decaying + capability-scoped credits prevent all 8 tested attack types with zero credit leakage: spam, hoarding, indirect transfer, verbose debate, over-abstention, overuse retrieval, confidence manipulation, and collusion.

What's novel?

The RS-OS taxonomy (arXiv:2605.02801), a comprehensive May 2026 survey of 84 papers on multi-agent resource allocation, confirms that no prior system combines:

  • Non-transferable credits (prevents laundering between colluding agents)
  • Exponential decay (prevents hoarding across tasks)
  • Capability-scoped access (retrieval rights β‰  file-write rights)
  • Cost-adjusted marginal impact reward (punishes confident-wrong, rewards abstention)

OCC directly addresses 4 of the 15 open problems identified in RS-OS.

The 9 Failed Jobs (that taught us everything)

Before the 75% HumanEval result, we ran 9 H200 jobs with Qwen2.5-Coder-7B-Instruct β€” all at 0% pass@1. The problem wasn't model capability (Qwen2.5-Coder gets 88% on HumanEval). It was prompt engineering:

  • Instruct models wrap code in "Here is a Python solution..." prose β€” no matter how strongly you tell them not to
  • Concatenating prompt + generation creates IndentationErrors if the indentation doesn't match
  • Chat templates vs completion format: the difference is 75% vs 0% pass@1

The fix was simple: completion format (raw function signature, no chat template), stop-token trimming, and switching to a stronger model (Qwen3-Coder-30B). Pipeline engineering is everything.

Try it

All code is open-source at narcolepticchicken/occ-stack.

from occ.oracle.oracle import ImpactOracle
from occ.ledger.ledger import CreditLedger
from occ.broker.broker import ResourceBroker

# Score an agent action
oracle = ImpactOracle()
score = oracle.score(action, context, result)

# Earn credits based on verified impact
ledger = CreditLedger()
entry = ledger.earn("agent_1", "task_1", "action_1",
                     earned=score["reward_value"],
                     oracle_score=score["raw_score"])

# Check if agent can access a resource
broker = ResourceBroker(ledger, oracle)
decision = broker.decide("agent_1", "retrieval", context)
# β†’ Decision.ALLOW or Decision.DENY

What's next

  1. GRPO training with the OCC reward hook on a 1.5B model β€” validates the reward end-to-end
  2. Iso-compute debate β€” 3-round equal-turns baseline for fair comparison with OCC
  3. Raise short tokens to 256 β€” many HumanEval failures are 128-token truncation artifacts

Built with ML Intern on Hugging Face. All code open-source. Real LLM results on H200 with Qwen3-Coder-30B-A3B-Instruct (Apache 2.0).

Repository: https://huggingface.co/narcolepticchicken/occ-stack