OCC Stack — Final Technical Report (v2)

Date: 2026-05-05
Status: Research prototype with simulated validation and real-LLM experiments in progress

Executive Summary

The Oracle-Credit-Compute (OCC) stack is a minimal, open-source framework for agentic compute allocation based on verified marginal impact. Agents earn non-transferable, decaying credits when they produce measurable value, and spend those credits to access computational resources. The system is designed to be publishable as a research prototype with four core components, three benchmarks, ablation studies, and anti-gaming tests.

System Overview

Four Core Components

Impact Oracle — Rule-based scorer for code, retrieval QA, and multi-agent debate. Outputs: correctness, calibration (Brier score), compute cost penalty, hallucination penalty, confident-wrong penalty, gaming detection.
Credit Ledger — Non-transferable, exponentially decaying, capability-scoped credits with full provenance (agent, task, action, score, cost, timestamp).
Resource Broker — Capability-based access control with six decision types: ALLOW, DENY, REQUIRE_APPROVAL, DOWNGRADE, ESCALATE, ASK_JUSTIFICATION.
GRPO/RL Hook — TRL-compatible reward function factory that wraps the oracle into reward_funcs(completions, **kwargs) -> List[float].

Design Philosophy

Rule-based over neural: Neural reward models are vulnerable to Goodhart's Law and reward hacking (Gao et al., 2023; Skalse et al., 2022). OCC uses auditable, fixed scoring rules.
Non-transferable + decaying: Prevents credit laundering and hoarding.
Capability-scoped: A retrieval agent does not automatically get shell_execute rights.

Simulated Benchmark Results

Benchmark 1: Code Compute Allocation

Strategy	Accuracy	Mean Compute	Key Mechanism
Fixed (expensive only)	0.73	350	Always use best model
Verifier-guided	0.73	~390	Retry on public test fail
OCC	0.73	195	Try cheap → medium → expensive

Result: 52.3% compute reduction at iso-accuracy (simulated).

Benchmark 2: Retrieval QA (Synthetic)

Strategy	Accuracy	Precision	Recall
Greedy	0.50	0.50	0.58
OCC	0.50	0.50	0.50

QA synthetic benchmark needs better evidence scoring (NLI model produces mostly neutral scores). This is a known limitation documented in the report.

Benchmark 3: Multi-Agent Debate (Adversarial)

Condition	Accuracy	Consensus	Notes
All honest (3 agents)	0.95	0.96	High agreement
40% adversarial, confidence voting	0.56	0.78	Collapses
40% adversarial, OCC credit-filter	0.76	0.64	+20pp vs naive

Result: OCC credit-filtering maintains 76% accuracy with 40% adversarial agents, vs 56% for naive confidence-weighted voting. All adversarial agents are contained (100% suppression rate).

Ablations (10 Conditions)

Ablation	Code Acc	Code Compute	Denied	QA Acc	Debate Acc
default	0.710	38,710	8	0.190	0.920
no_decay	0.710	37,710	4	0.190	0.920
fast_decay	0.690	37,910	12	0.150	0.920
no_gaming_penalty	0.730	38,650	0	0.190	0.920
high_gaming_penalty	0.710	38,710	8	0.190	0.920
lenient_broker	0.740	39,010	4	0.190	0.920
strict_broker	0.685	36,060	8	0.180	0.920
high_compute_cost	0.710	38,710	8	0.200	0.920
low_compute_cost	0.710	38,710	8	0.190	0.920
anti_gaming_off	0.730	38,650	0	0.190	0.920

Key findings:

Decay matters: Fast decay (λ=0.1) reduces accuracy by 2pp by denying more agents, but saves 2.5% compute.
Broker strictness: Lenient broker (thresholds ×0.5) improves accuracy by 3pp by allowing more agent attempts. Strict broker (×2.0) reduces accuracy by 2.5pp but saves 7% compute.
Gaming penalty: Disabling gaming penalties increases accuracy by 2pp in simulation (adversarial agents not present in ablation), but would be catastrophic in adversarial settings.

Anti-Gaming Tests

Attack	Detection	Containment	Status
Hidden-test gaming	`public_pass=True, hidden_pass=False`	-2.0 penalty, negative reward	✅ Working
Collusion / transfer	`transfer()` returns False	Alice keeps credits, Bob gets 0	✅ Working
Over-abstention	Wrong abstention on answerable Q	-1.0 reward	✅ Working
Spam / excessive compute	compute > 2000, score < 0.5	-1.8 reward	✅ Working

Real LLM Experiments (In Progress)

Attempted: Qwen 0.5B on HumanEval

Status: Code extraction bug — model outputs complete functions but markdown fences and duplicate imports cause syntax errors.
Attempts: V1–V6 with progressively better extraction logic.
V7 fix: Regex-based code extraction + larger model (Qwen 1.5B) + 512 tokens.
Result: Pending (job submitted on a10g-small GPU).

NLI Evidence Scoring

Status: cross-encoder/nli-deberta-v3-xsmall loads and runs but produces mostly neutral scores on synthetic QA evidence.
Lesson: Domain-tuned NLI or better evidence text needed for QA scoring.

Known Limitations

Real LLM results pending: Code extraction from small models is harder than expected. We are iterating on regex-based extraction and larger models.
QA benchmark synthetic: No public adversarial QA dataset combines unanswerable + misleading + conflicting evidence in one. We generate synthetic data but it may not transfer.
Debate benchmark simplified: Adversarial behavior is simulated (overconfident wrong answers, sycophancy) rather than generated by a real adversarial model.
GRPO training not run: We provide the reward-function factory and offline comparator but have not done a full GRPO training run due to compute constraints.
No online learning: Thresholds and weights are hardcoded. A production system would learn them from historical data.

What Is Novel vs. Borrowed

Component	Novelty	Source
Credit-decay + capability scoping	Possibly novel combination	Inspired by economic credit systems
Rule-based oracle with Brier calibration	Adapted	ConfTuner (RLCR), MetaFaith
Gaming detection rules	Adapted	RS-OS taxonomy, Du et al.
Non-transferable credits	Standard	AgentGuardian, SAGA
GRPO reward hook	Standard	DeepSeek-R1 TRL pattern

Repository

HF Bucket: https://huggingface.co/narcolepticchicken/occ-stack
Files: 45 files, 272.4 KB
Structure: oracle/, ledger/, broker/, rl/, benchmarks/, tests/, reports/, jobs/

How to Use

git clone https://huggingface.co/narcolepticchicken/occ-stack
cd occ-stack
pip install -r requirements.txt

# Run simulated benchmarks
python benchmarks/benchmark_code.py
python benchmarks/benchmark_retrieval_qa.py
python benchmarks/benchmark_debate_v2.py

# Run ablations + anti-gaming
python eval_runner.py

# Run real LLM benchmark (requires GPU)
python jobs/run_real_llm_standalone_v7.py

# Run unit tests
python tests/test_oracle.py
python tests/test_ledger.py

Future Work

Fix code extraction for real LLM benchmark (V7 in progress)
Run actual GRPO training on DeepMath-103K with cost-aware rewards
Evaluate on real adversarial QA (e.g., AdversarialQA, AmbigQA)
Implement hierarchical broker with dynamic threshold learning
Add peer-review mode: multiple oracles vote on controversial actions

Citation

@misc{occ2026,
  title={Oracle-Credit-Compute: A Minimal Stack for Agentic Compute Allocation},
  author={narcolepticchicken},
  year={2026},
  url={https://huggingface.co/narcolepticchicken/occ-stack}
}