OCC Stack: Final Status
What Ships
β Done (production quality)
- Impact Oracle (
oracle/oracle.py) β Rule-based scoring for code/QA/debate. Detects hidden-test gaming, rewards abstention, Brier-score calibration, compute-cost penalty. - Credit Ledger (
ledger/ledger.py) β Non-transferable, decaying, capability-scoped credits with full provenance. - Resource Broker (
broker/broker.py) β Capability-based gating with 6 decision types and risk classes. - GRPO/RL Hook (
rl/reward.py,rl/grpo_hook.py) β TRL-compatible reward function + offline comparator. - Literature Review (
reports/literature_review.md+ RS-OS paper comparison inreports/report.md). - Blog Post (
reports/blog_post.md).
β Done (simulated benchmarks)
- Code Compute Allocation (
benchmarks/benchmark_code.py) β 52.3% compute savings at iso-accuracy. - Retrieval QA (
benchmarks/benchmark_retrieval_qa.py,_nli.py) β OCC underperforms, honest negative result. - Debate v2 (
benchmarks/benchmark_debate_v2.py) β 43.2% savings at iso-accuracy, adversarial containment. - Anti-Gaming (
eval_runner.py) β 100% hidden-test detection, credit exhaustion for spam. - Ablations β 10 ablations measuring each mechanism's contribution.
β Done (external validation)
- Debate v2 job
69fa273ab745af80fb373135: COMPLETED. Results atreports/debate_v2_results.json.- OCC: 0.930 accuracy, 2,890 mean compute β 43.2% savings vs equal turns (5,087)
- Confidence-weighted voting with adversarial agents: dangerous (amplifies overconfident wrong answers)
β οΈ Blocked (real LLM)
- 4 GPU jobs attempted, 4 failed due to model capability or infrastructure:
- Qwen-Coder-0.5B: chat template mismatch β all answers wrong
- Qwen-Coder-0.5B v2: chat template fixed, model generates code but 0% pass rate (0.5B too weak for HumanEval)
- Qwen-Coder-0.5B v3: robust extraction, same 0% pass rate (model capability floor)
- StarCoder2-3B: model loading timed out before generation (3B download too slow on provisioned T4)
β Not Done
- GRPO training (needs GPU + TRL, not attempted due to sandbox rate-limiting)
- Retrieval QA with domain-tuned NLI
- Real LLM results for code benchmark
Key Numbers
| Benchmark | OCC Result | Best Baseline | Savings |
|---|---|---|---|
| Code allocation (sim) | 0.780 acc, 8,350 tokens | 0.780 acc, 17,500 tokens | 52.3% |
| Debate v2 (40% adversarial) | 0.930 acc, 2,890 tokens | 0.930 acc, 5,087 tokens | 43.2% |
| Anti-gaming detection | 100% | β | β |
| Retrieval QA | 0.710 acc | 0.790 (RAG+verifier) | OCC loses |
Honest Bottom Line
OCC works for code allocation and debate β the mechanisms (tiered escalation, credit-based turn allocation) are sound and backed by published literature. The real-LLM validation is the missing piece, blocked by model choice (0.5B too weak) and infrastructure (3B download timing). The system design, anti-gaming properties, and literature positioning are solid enough for a workshop paper.