narcolepticchicken
/

occ-stack

Model card Files Files and versions

occ-stack / reports /final_status.md

narcolepticchicken's picture

narcolepticchicken

Upload reports/final_status.md

365eeb5 verified 27 days ago

|

history blame contribute delete

3.18 kB

	# OCC Stack: Final Status

	## What Ships

	### ✅ Done (production quality)
	1. Impact Oracle (`oracle/oracle.py`) — Rule-based scoring for code/QA/debate. Detects hidden-test gaming, rewards abstention, Brier-score calibration, compute-cost penalty.
	2. Credit Ledger (`ledger/ledger.py`) — Non-transferable, decaying, capability-scoped credits with full provenance.
	3. Resource Broker (`broker/broker.py`) — Capability-based gating with 6 decision types and risk classes.
	4. GRPO/RL Hook (`rl/reward.py`, `rl/grpo_hook.py`) — TRL-compatible reward function + offline comparator.
	5. Literature Review (`reports/literature_review.md` + RS-OS paper comparison in `reports/report.md`).
	6. Blog Post (`reports/blog_post.md`).

	### ✅ Done (simulated benchmarks)
	1. Code Compute Allocation (`benchmarks/benchmark_code.py`) — 52.3% compute savings at iso-accuracy.
	2. Retrieval QA (`benchmarks/benchmark_retrieval_qa.py`, `_nli.py`) — OCC underperforms, honest negative result.
	3. Debate v2 (`benchmarks/benchmark_debate_v2.py`) — 43.2% savings at iso-accuracy, adversarial containment.
	4. Anti-Gaming (`eval_runner.py`) — 100% hidden-test detection, credit exhaustion for spam.
	5. Ablations — 10 ablations measuring each mechanism's contribution.

	### ✅ Done (external validation)
	- Debate v2 job `69fa273ab745af80fb373135`: COMPLETED. Results at `reports/debate_v2_results.json`.
	- OCC: 0.930 accuracy, 2,890 mean compute → 43.2% savings vs equal turns (5,087)
	- Confidence-weighted voting with adversarial agents: dangerous (amplifies overconfident wrong answers)

	### ⚠️ Blocked (real LLM)
	- 4 GPU jobs attempted, 4 failed due to model capability or infrastructure:
	1. Qwen-Coder-0.5B: chat template mismatch → all answers wrong
	2. Qwen-Coder-0.5B v2: chat template fixed, model generates code but 0% pass rate (0.5B too weak for HumanEval)
	3. Qwen-Coder-0.5B v3: robust extraction, same 0% pass rate (model capability floor)
	4. StarCoder2-3B: model loading timed out before generation (3B download too slow on provisioned T4)

	### ❌ Not Done
	- GRPO training (needs GPU + TRL, not attempted due to sandbox rate-limiting)
	- Retrieval QA with domain-tuned NLI
	- Real LLM results for code benchmark

	## Key Numbers

	\| Benchmark \| OCC Result \| Best Baseline \| Savings \|
	\|-----------\|-----------\|---------------\|---------\|
	\| Code allocation (sim) \| 0.780 acc, 8,350 tokens \| 0.780 acc, 17,500 tokens \| 52.3% \|
	\| Debate v2 (40% adversarial) \| 0.930 acc, 2,890 tokens \| 0.930 acc, 5,087 tokens \| 43.2% \|
	\| Anti-gaming detection \| 100% \| — \| — \|
	\| Retrieval QA \| 0.710 acc \| 0.790 (RAG+verifier) \| OCC loses \|

	## Honest Bottom Line

	OCC works for code allocation and debate — the mechanisms (tiered escalation, credit-based turn allocation) are sound and backed by published literature. The real-LLM validation is the missing piece, blocked by model choice (0.5B too weak) and infrastructure (3B download timing). The system design, anti-gaming properties, and literature positioning are solid enough for a workshop paper.

	## Repository

	https://huggingface.co/narcolepticchicken/occ-stack