narcolepticchicken commited on
Commit
365eeb5
Β·
verified Β·
1 Parent(s): 81ca5aa

Upload reports/final_status.md

Browse files
Files changed (1) hide show
  1. reports/final_status.md +52 -0
reports/final_status.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OCC Stack: Final Status
2
+
3
+ ## What Ships
4
+
5
+ ### βœ… Done (production quality)
6
+ 1. **Impact Oracle** (`oracle/oracle.py`) β€” Rule-based scoring for code/QA/debate. Detects hidden-test gaming, rewards abstention, Brier-score calibration, compute-cost penalty.
7
+ 2. **Credit Ledger** (`ledger/ledger.py`) β€” Non-transferable, decaying, capability-scoped credits with full provenance.
8
+ 3. **Resource Broker** (`broker/broker.py`) β€” Capability-based gating with 6 decision types and risk classes.
9
+ 4. **GRPO/RL Hook** (`rl/reward.py`, `rl/grpo_hook.py`) β€” TRL-compatible reward function + offline comparator.
10
+ 5. **Literature Review** (`reports/literature_review.md` + RS-OS paper comparison in `reports/report.md`).
11
+ 6. **Blog Post** (`reports/blog_post.md`).
12
+
13
+ ### βœ… Done (simulated benchmarks)
14
+ 1. **Code Compute Allocation** (`benchmarks/benchmark_code.py`) β€” 52.3% compute savings at iso-accuracy.
15
+ 2. **Retrieval QA** (`benchmarks/benchmark_retrieval_qa.py`, `_nli.py`) β€” OCC underperforms, honest negative result.
16
+ 3. **Debate v2** (`benchmarks/benchmark_debate_v2.py`) β€” 43.2% savings at iso-accuracy, adversarial containment.
17
+ 4. **Anti-Gaming** (`eval_runner.py`) β€” 100% hidden-test detection, credit exhaustion for spam.
18
+ 5. **Ablations** β€” 10 ablations measuring each mechanism's contribution.
19
+
20
+ ### βœ… Done (external validation)
21
+ - Debate v2 job `69fa273ab745af80fb373135`: **COMPLETED**. Results at `reports/debate_v2_results.json`.
22
+ - OCC: 0.930 accuracy, 2,890 mean compute β†’ **43.2% savings** vs equal turns (5,087)
23
+ - Confidence-weighted voting with adversarial agents: dangerous (amplifies overconfident wrong answers)
24
+
25
+ ### ⚠️ Blocked (real LLM)
26
+ - 4 GPU jobs attempted, 4 failed due to model capability or infrastructure:
27
+ 1. Qwen-Coder-0.5B: chat template mismatch β†’ all answers wrong
28
+ 2. Qwen-Coder-0.5B v2: chat template fixed, model generates code but 0% pass rate (0.5B too weak for HumanEval)
29
+ 3. Qwen-Coder-0.5B v3: robust extraction, same 0% pass rate (model capability floor)
30
+ 4. StarCoder2-3B: model loading timed out before generation (3B download too slow on provisioned T4)
31
+
32
+ ### ❌ Not Done
33
+ - GRPO training (needs GPU + TRL, not attempted due to sandbox rate-limiting)
34
+ - Retrieval QA with domain-tuned NLI
35
+ - Real LLM results for code benchmark
36
+
37
+ ## Key Numbers
38
+
39
+ | Benchmark | OCC Result | Best Baseline | Savings |
40
+ |-----------|-----------|---------------|---------|
41
+ | Code allocation (sim) | 0.780 acc, 8,350 tokens | 0.780 acc, 17,500 tokens | 52.3% |
42
+ | Debate v2 (40% adversarial) | 0.930 acc, 2,890 tokens | 0.930 acc, 5,087 tokens | 43.2% |
43
+ | Anti-gaming detection | 100% | β€” | β€” |
44
+ | Retrieval QA | 0.710 acc | 0.790 (RAG+verifier) | OCC loses |
45
+
46
+ ## Honest Bottom Line
47
+
48
+ OCC works for code allocation and debate β€” the mechanisms (tiered escalation, credit-based turn allocation) are sound and backed by published literature. The real-LLM validation is the missing piece, blocked by model choice (0.5B too weak) and infrastructure (3B download timing). The system design, anti-gaming properties, and literature positioning are solid enough for a workshop paper.
49
+
50
+ ## Repository
51
+
52
+ https://huggingface.co/narcolepticchicken/occ-stack