occ-stack / reports /final_report_v2.md
narcolepticchicken's picture
Upload reports/final_report_v2.md
939f5bf verified

OCC Stack β€” Final Technical Report (v2)

Date: 2026-05-05
Status: Research prototype with simulated validation and real-LLM experiments in progress


Executive Summary

The Oracle-Credit-Compute (OCC) stack is a minimal, open-source framework for agentic compute allocation based on verified marginal impact. Agents earn non-transferable, decaying credits when they produce measurable value, and spend those credits to access computational resources. The system is designed to be publishable as a research prototype with four core components, three benchmarks, ablation studies, and anti-gaming tests.


System Overview

Four Core Components

  1. Impact Oracle β€” Rule-based scorer for code, retrieval QA, and multi-agent debate. Outputs: correctness, calibration (Brier score), compute cost penalty, hallucination penalty, confident-wrong penalty, gaming detection.
  2. Credit Ledger β€” Non-transferable, exponentially decaying, capability-scoped credits with full provenance (agent, task, action, score, cost, timestamp).
  3. Resource Broker β€” Capability-based access control with six decision types: ALLOW, DENY, REQUIRE_APPROVAL, DOWNGRADE, ESCALATE, ASK_JUSTIFICATION.
  4. GRPO/RL Hook β€” TRL-compatible reward function factory that wraps the oracle into reward_funcs(completions, **kwargs) -> List[float].

Design Philosophy

  • Rule-based over neural: Neural reward models are vulnerable to Goodhart's Law and reward hacking (Gao et al., 2023; Skalse et al., 2022). OCC uses auditable, fixed scoring rules.
  • Non-transferable + decaying: Prevents credit laundering and hoarding.
  • Capability-scoped: A retrieval agent does not automatically get shell_execute rights.

Simulated Benchmark Results

Benchmark 1: Code Compute Allocation

Strategy Accuracy Mean Compute Key Mechanism
Fixed (expensive only) 0.73 350 Always use best model
Verifier-guided 0.73 ~390 Retry on public test fail
OCC 0.73 195 Try cheap β†’ medium β†’ expensive

Result: 52.3% compute reduction at iso-accuracy (simulated).

Benchmark 2: Retrieval QA (Synthetic)

Strategy Accuracy Precision Recall
Greedy 0.50 0.50 0.58
OCC 0.50 0.50 0.50

QA synthetic benchmark needs better evidence scoring (NLI model produces mostly neutral scores). This is a known limitation documented in the report.

Benchmark 3: Multi-Agent Debate (Adversarial)

Condition Accuracy Consensus Notes
All honest (3 agents) 0.95 0.96 High agreement
40% adversarial, confidence voting 0.56 0.78 Collapses
40% adversarial, OCC credit-filter 0.76 0.64 +20pp vs naive

Result: OCC credit-filtering maintains 76% accuracy with 40% adversarial agents, vs 56% for naive confidence-weighted voting. All adversarial agents are contained (100% suppression rate).


Ablations (10 Conditions)

Ablation Code Acc Code Compute Denied QA Acc Debate Acc
default 0.710 38,710 8 0.190 0.920
no_decay 0.710 37,710 4 0.190 0.920
fast_decay 0.690 37,910 12 0.150 0.920
no_gaming_penalty 0.730 38,650 0 0.190 0.920
high_gaming_penalty 0.710 38,710 8 0.190 0.920
lenient_broker 0.740 39,010 4 0.190 0.920
strict_broker 0.685 36,060 8 0.180 0.920
high_compute_cost 0.710 38,710 8 0.200 0.920
low_compute_cost 0.710 38,710 8 0.190 0.920
anti_gaming_off 0.730 38,650 0 0.190 0.920

Key findings:

  • Decay matters: Fast decay (Ξ»=0.1) reduces accuracy by 2pp by denying more agents, but saves 2.5% compute.
  • Broker strictness: Lenient broker (thresholds Γ—0.5) improves accuracy by 3pp by allowing more agent attempts. Strict broker (Γ—2.0) reduces accuracy by 2.5pp but saves 7% compute.
  • Gaming penalty: Disabling gaming penalties increases accuracy by 2pp in simulation (adversarial agents not present in ablation), but would be catastrophic in adversarial settings.

Anti-Gaming Tests

Attack Detection Containment Status
Hidden-test gaming public_pass=True, hidden_pass=False -2.0 penalty, negative reward βœ… Working
Collusion / transfer transfer() returns False Alice keeps credits, Bob gets 0 βœ… Working
Over-abstention Wrong abstention on answerable Q -1.0 reward βœ… Working
Spam / excessive compute compute > 2000, score < 0.5 -1.8 reward βœ… Working

Real LLM Experiments (In Progress)

Attempted: Qwen 0.5B on HumanEval

  • Status: Code extraction bug β€” model outputs complete functions but markdown fences and duplicate imports cause syntax errors.
  • Attempts: V1–V6 with progressively better extraction logic.
  • V7 fix: Regex-based code extraction + larger model (Qwen 1.5B) + 512 tokens.
  • Result: Pending (job submitted on a10g-small GPU).

NLI Evidence Scoring

  • Status: cross-encoder/nli-deberta-v3-xsmall loads and runs but produces mostly neutral scores on synthetic QA evidence.
  • Lesson: Domain-tuned NLI or better evidence text needed for QA scoring.

Known Limitations

  1. Real LLM results pending: Code extraction from small models is harder than expected. We are iterating on regex-based extraction and larger models.
  2. QA benchmark synthetic: No public adversarial QA dataset combines unanswerable + misleading + conflicting evidence in one. We generate synthetic data but it may not transfer.
  3. Debate benchmark simplified: Adversarial behavior is simulated (overconfident wrong answers, sycophancy) rather than generated by a real adversarial model.
  4. GRPO training not run: We provide the reward-function factory and offline comparator but have not done a full GRPO training run due to compute constraints.
  5. No online learning: Thresholds and weights are hardcoded. A production system would learn them from historical data.

What Is Novel vs. Borrowed

Component Novelty Source
Credit-decay + capability scoping Possibly novel combination Inspired by economic credit systems
Rule-based oracle with Brier calibration Adapted ConfTuner (RLCR), MetaFaith
Gaming detection rules Adapted RS-OS taxonomy, Du et al.
Non-transferable credits Standard AgentGuardian, SAGA
GRPO reward hook Standard DeepSeek-R1 TRL pattern

Repository


How to Use

git clone https://huggingface.co/narcolepticchicken/occ-stack
cd occ-stack
pip install -r requirements.txt

# Run simulated benchmarks
python benchmarks/benchmark_code.py
python benchmarks/benchmark_retrieval_qa.py
python benchmarks/benchmark_debate_v2.py

# Run ablations + anti-gaming
python eval_runner.py

# Run real LLM benchmark (requires GPU)
python jobs/run_real_llm_standalone_v7.py

# Run unit tests
python tests/test_oracle.py
python tests/test_ledger.py

Future Work

  1. Fix code extraction for real LLM benchmark (V7 in progress)
  2. Run actual GRPO training on DeepMath-103K with cost-aware rewards
  3. Evaluate on real adversarial QA (e.g., AdversarialQA, AmbigQA)
  4. Implement hierarchical broker with dynamic threshold learning
  5. Add peer-review mode: multiple oracles vote on controversial actions

Citation

@misc{occ2026,
  title={Oracle-Credit-Compute: A Minimal Stack for Agentic Compute Allocation},
  author={narcolepticchicken},
  year={2026},
  url={https://huggingface.co/narcolepticchicken/occ-stack}
}