occ-stack / reports /final_status.md
narcolepticchicken's picture
Upload reports/final_status.md
365eeb5 verified

OCC Stack: Final Status

What Ships

βœ… Done (production quality)

  1. Impact Oracle (oracle/oracle.py) β€” Rule-based scoring for code/QA/debate. Detects hidden-test gaming, rewards abstention, Brier-score calibration, compute-cost penalty.
  2. Credit Ledger (ledger/ledger.py) β€” Non-transferable, decaying, capability-scoped credits with full provenance.
  3. Resource Broker (broker/broker.py) β€” Capability-based gating with 6 decision types and risk classes.
  4. GRPO/RL Hook (rl/reward.py, rl/grpo_hook.py) β€” TRL-compatible reward function + offline comparator.
  5. Literature Review (reports/literature_review.md + RS-OS paper comparison in reports/report.md).
  6. Blog Post (reports/blog_post.md).

βœ… Done (simulated benchmarks)

  1. Code Compute Allocation (benchmarks/benchmark_code.py) β€” 52.3% compute savings at iso-accuracy.
  2. Retrieval QA (benchmarks/benchmark_retrieval_qa.py, _nli.py) β€” OCC underperforms, honest negative result.
  3. Debate v2 (benchmarks/benchmark_debate_v2.py) β€” 43.2% savings at iso-accuracy, adversarial containment.
  4. Anti-Gaming (eval_runner.py) β€” 100% hidden-test detection, credit exhaustion for spam.
  5. Ablations β€” 10 ablations measuring each mechanism's contribution.

βœ… Done (external validation)

  • Debate v2 job 69fa273ab745af80fb373135: COMPLETED. Results at reports/debate_v2_results.json.
    • OCC: 0.930 accuracy, 2,890 mean compute β†’ 43.2% savings vs equal turns (5,087)
    • Confidence-weighted voting with adversarial agents: dangerous (amplifies overconfident wrong answers)

⚠️ Blocked (real LLM)

  • 4 GPU jobs attempted, 4 failed due to model capability or infrastructure:
    1. Qwen-Coder-0.5B: chat template mismatch β†’ all answers wrong
    2. Qwen-Coder-0.5B v2: chat template fixed, model generates code but 0% pass rate (0.5B too weak for HumanEval)
    3. Qwen-Coder-0.5B v3: robust extraction, same 0% pass rate (model capability floor)
    4. StarCoder2-3B: model loading timed out before generation (3B download too slow on provisioned T4)

❌ Not Done

  • GRPO training (needs GPU + TRL, not attempted due to sandbox rate-limiting)
  • Retrieval QA with domain-tuned NLI
  • Real LLM results for code benchmark

Key Numbers

Benchmark OCC Result Best Baseline Savings
Code allocation (sim) 0.780 acc, 8,350 tokens 0.780 acc, 17,500 tokens 52.3%
Debate v2 (40% adversarial) 0.930 acc, 2,890 tokens 0.930 acc, 5,087 tokens 43.2%
Anti-gaming detection 100% β€” β€”
Retrieval QA 0.710 acc 0.790 (RAG+verifier) OCC loses

Honest Bottom Line

OCC works for code allocation and debate β€” the mechanisms (tiered escalation, credit-based turn allocation) are sound and backed by published literature. The real-LLM validation is the missing piece, blocked by model choice (0.5B too weak) and infrastructure (3B download timing). The system design, anti-gaming properties, and literature positioning are solid enough for a workshop paper.

Repository

https://huggingface.co/narcolepticchicken/occ-stack