narcolepticchicken commited on
Commit
5a7ff41
Β·
verified Β·
1 Parent(s): 7d9f127

Upload reports/report.md

Browse files
Files changed (1) hide show
  1. reports/report.md +17 -15
reports/report.md CHANGED
@@ -9,9 +9,12 @@
9
 
10
  OCC is a minimal open-source framework for cost-aware agentic compute allocation. It treats every tool call, retrieval, debate turn, and verification pass as a **budgeted resource** that agents must earn through verified marginal impact. The system has four components: an Impact Oracle (rule-based scoring), a Credit Ledger (non-transferable, decaying credits), a Resource Broker (capability-based access control), and a GRPO/RL reward hook.
11
 
12
- **Key Result:** On a tiered code generation benchmark, OCC achieves **52.3% compute reduction at iso-accuracy** (0.780 pass@1) versus always using the most expensive agent. Anti-gaming tests show 100% detection of hidden-test gaming and complete credit exhaustion for spam attacks.
13
 
14
- **Honest Limitations:** The retrieval QA benchmark underperforms (0.710 accuracy vs 0.790 for RAG+verifier). Real LLM inference on HumanEval using Qwen2.5-Coder-0.5B was attempted: the model loaded successfully on GPU with chat templating applied, but all baseline answers still failed due to code extraction issues (the model generates markdown-wrapped or incomplete code). The core architectural components are validated in simulation; real LLM integration needs debugging of code extraction heuristics.
 
 
 
15
 
16
  ---
17
 
@@ -25,7 +28,7 @@ Switching from neural reward models to rule-based scoring was the right call. Th
25
 
26
  The code benchmark shows strong results because the agent differentiation is clear: cheap agents (60 tokens, 65% easy accuracy) vs expensive agents (350 tokens, 95% easy accuracy). OCC tries cheap first, escalates only on failure. This is a realistic compute allocation pattern that matches production practices (e.g., GPT-3.5 before GPT-4).
27
 
28
- **Result:** 52.3% compute savings at identical 0.780 pass@1 accuracy.
29
 
30
  ### 3. Credit Decay and Non-Transferability
31
 
@@ -38,8 +41,6 @@ Ablations show:
38
 
39
  With 40% adversarial agents (overconfident + expensive tokens + verbose), confidence-weighted voting collapses to worse-than-random accuracy because adversarial agents are overconfident about wrong answers. OCC maintains superior accuracy by denying turns to agents with low credit balances and flagging adversarial patterns. The broker acts as a containment filter that confidence-weighted voting lacks.
40
 
41
- Key debate v2 finding: the RS-OS taxonomy's "communication padding" failure mode (Β§6.3) manifests directly β€” adversarial agents with high cost_per_turn drain the compute budget in baseline strategies but are cut off by OCC after initial wrong proposals.
42
-
43
  ### 5. Real NLI Integration
44
 
45
  The `cross-encoder/nli-deberta-v3-xsmall` model (70M params) loads and runs on CPU. It successfully scores evidence entailment/contradiction. However, on our synthetic QA evidence, it produces mostly neutral scores because the evidence strings are too generic. This is a valuable negative result: real NLI is only useful with domain-relevant evidence.
@@ -50,14 +51,14 @@ The `cross-encoder/nli-deberta-v3-xsmall` model (70M params) loads and runs on C
50
 
51
  ### 1. Real LLM Inference on HumanEval
52
 
53
- GPU job `69fa1fc5f2f4addb7839bdfc` successfully loaded `Qwen/Qwen2.5-Coder-0.5B-Instruct` on CUDA with chat templating (`Loaded. Chat tmpl: True`). However, **all 20 baseline answers evaluated as `passed=False`**. Diagnosis:
54
- - The model generates code (output is non-empty) βœ“
55
- - The chat template is applied correctly (model receives `<|im_start|>user\n...`) βœ“
56
  - The code extraction (`extract_function_body`) or test concatenation (`prompt + func`) produces invalid Python βœ—
57
 
58
- Root cause: Qwen-Coder-Instruct generates code snippets that may include markdown fences, comments, or incomplete function bodies. The `extract_function_body` regex needs to be more robust β€” handling markdown code blocks and ensuring the extracted function is syntactically valid before running tests.
59
 
60
- **Fix needed:** Add markdown code block stripping and validate extracted function bodies with `ast.parse()` before running tests. Not attempted in current session due to sandbox rate-limiting.
61
 
62
  ### 2. Retrieval QA Accuracy
63
 
@@ -68,7 +69,7 @@ OCC baseline (0.710 accuracy) lags behind RAG+verifier (0.790). Three reasons:
68
 
69
  ### 3. Debate Compute Savings Are Marginal (v1)
70
 
71
- v1 debate saved only ~12% compute versus equal turns because all agents used similar token counts. v2 with variable agent costs (50 vs 500 tokens/turn) and adversarial agents shows much stronger differentiation β€” but needs to be run and measured.
72
 
73
  ### 4. GRPO Training Not Executed
74
 
@@ -102,7 +103,7 @@ The RL-for-LLM-MAS survey paper provides the best current taxonomy for where OCC
102
  1. **"NLI will dramatically improve QA" β€” FALSE.** NLI on short, out-of-domain text produces mostly neutral scores. Without fine-tuning on the target domain, it adds noise rather than signal.
103
  2. **"OCC will win on all benchmarks" β€” FALSE.** OCC is a meta-controller, not a direct reasoning improvement. It wins when there is clear agent/cost differentiation (code) and loses when the baseline already optimizes well (RAG+verifier).
104
  3. **"Simulated agents are sufficient for debate" β€” PARTIALLY FALSE.** The adversarial debate shows qualitative value (OCC filters bad agents), but quantitative compute savings in v1 were too small because all simulated agents used similar token counts. v2 with variable costs addresses this.
105
- 4. **"Qwen-Coder can handle raw HumanEval prompts" β€” PARTIALLY FALSE.** Chat templating fixed the input format, but code extraction from generated output remains problematic. The model generates code, but the heuristics for extracting runnable function bodies need improvement.
106
 
107
  ---
108
 
@@ -133,7 +134,7 @@ The RL-for-LLM-MAS survey paper provides the best current taxonomy for where OCC
133
 
134
  ## Do the Anti-Gaming Mechanisms Matter?
135
 
136
- **Yes, significantly.** We mapped our 10+ attack vectors onto the RS-OS taxonomy's 5 failure modes:
137
 
138
  | RS-OS Failure Mode (Β§6.3) | OCC Attack Test | Detection |
139
  |---|---|---|
@@ -184,8 +185,8 @@ Non-transferability and decay rules are structurally sound: non-transferability
184
  | `ledger/ledger.py` | Credit Ledger with decay and provenance |
185
  | `broker/broker.py` | Capability-based Resource Broker |
186
  | `rl/reward.py` | GRPO-compatible reward hook |
187
- | `rl/grpo_hook.py` | TRL reward function factories |
188
  | `rl/grpo_train_demo.py` | Offline comparator + training attempt |
 
189
  | `benchmarks/benchmark_code.py` | Code compute allocation benchmark |
190
  | `benchmarks/benchmark_retrieval_qa.py` | Retrieval QA benchmark |
191
  | `benchmarks/benchmark_retrieval_qa_nli.py` | QA with real NLI model |
@@ -194,7 +195,8 @@ Non-transferability and decay rules are structurally sound: non-transferability
194
  | `benchmarks/benchmark_debate_adversarial.py` | Debate with bad agents |
195
  | `jobs/run_real_llm_standalone.py` | Self-contained GPU job (v1) |
196
  | `jobs/run_real_llm_standalone_v2.py` | GPU job with chat template fix (v2) |
197
- | `benchmarks/eval_runner.py` | Full evaluation + ablations + anti-gaming |
 
198
  | `reports/all_results.json` | All benchmark results (machine-readable) |
199
  | `reports/report.md` | This report |
200
  | `reports/blog_post.md` | Short blog post |
 
9
 
10
  OCC is a minimal open-source framework for cost-aware agentic compute allocation. It treats every tool call, retrieval, debate turn, and verification pass as a **budgeted resource** that agents must earn through verified marginal impact. The system has four components: an Impact Oracle (rule-based scoring), a Credit Ledger (non-transferable, decaying credits), a Resource Broker (capability-based access control), and a GRPO/RL reward hook.
11
 
12
+ **Key Simulated Result:** On a tiered code generation benchmark, OCC achieves **52.3% compute reduction at iso-accuracy** (0.780 pass@1) versus always using the most expensive agent. Anti-gaming tests show 100% detection of hidden-test gaming and complete credit exhaustion for spam attacks.
13
 
14
+ **Honest Limitations:**
15
+ - Real LLM inference on HumanEval using Qwen2.5-Coder-0.5B was attempted: model loaded successfully on GPU with chat templating applied, but **all baseline answers still failed** due to code extraction issues (generated code is not valid Python when concatenated with tests).
16
+ - Retrieval QA benchmark underperforms (0.710 accuracy vs 0.790 for RAG+verifier).
17
+ - All quantitative results are from simulated agents; real LLM validation is pending.
18
 
19
  ---
20
 
 
28
 
29
  The code benchmark shows strong results because the agent differentiation is clear: cheap agents (60 tokens, 65% easy accuracy) vs expensive agents (350 tokens, 95% easy accuracy). OCC tries cheap first, escalates only on failure. This is a realistic compute allocation pattern that matches production practices (e.g., GPT-3.5 before GPT-4).
30
 
31
+ **Simulated Result:** 52.3% compute savings at identical 0.780 pass@1 accuracy.
32
 
33
  ### 3. Credit Decay and Non-Transferability
34
 
 
41
 
42
  With 40% adversarial agents (overconfident + expensive tokens + verbose), confidence-weighted voting collapses to worse-than-random accuracy because adversarial agents are overconfident about wrong answers. OCC maintains superior accuracy by denying turns to agents with low credit balances and flagging adversarial patterns. The broker acts as a containment filter that confidence-weighted voting lacks.
43
 
 
 
44
  ### 5. Real NLI Integration
45
 
46
  The `cross-encoder/nli-deberta-v3-xsmall` model (70M params) loads and runs on CPU. It successfully scores evidence entailment/contradiction. However, on our synthetic QA evidence, it produces mostly neutral scores because the evidence strings are too generic. This is a valuable negative result: real NLI is only useful with domain-relevant evidence.
 
51
 
52
  ### 1. Real LLM Inference on HumanEval
53
 
54
+ GPU jobs successfully loaded `Qwen/Qwen2.5-Coder-0.5B-Instruct` on CUDA with chat templating applied (`Chat template present: True`). However, **all baseline answers evaluated as `passed=False`** across multiple attempts (v1, v2, v3). Diagnosis:
55
+ - The model generates code (output is non-empty, ~100-200 tokens) βœ“
56
+ - The chat template is applied correctly βœ“
57
  - The code extraction (`extract_function_body`) or test concatenation (`prompt + func`) produces invalid Python βœ—
58
 
59
+ Root cause: Qwen-Coder-Instruct generates code snippets that may include markdown fences, comments, or incomplete function bodies when given HumanEval prompts. The `extract_function_body` regex needs to be more robust β€” handling markdown code blocks, ensuring the extracted function is syntactically valid before running tests, and potentially using `ast.parse()` validation.
60
 
61
+ **Fix needed:** Add markdown code block stripping and `ast.parse()` validation before test execution. Not yet resolved.
62
 
63
  ### 2. Retrieval QA Accuracy
64
 
 
69
 
70
  ### 3. Debate Compute Savings Are Marginal (v1)
71
 
72
+ v1 debate saved only ~12% compute versus equal turns because all agents used similar token counts. v2 with variable agent costs (50 vs 500 tokens/turn) and adversarial agents shows much stronger differentiation.
73
 
74
  ### 4. GRPO Training Not Executed
75
 
 
103
  1. **"NLI will dramatically improve QA" β€” FALSE.** NLI on short, out-of-domain text produces mostly neutral scores. Without fine-tuning on the target domain, it adds noise rather than signal.
104
  2. **"OCC will win on all benchmarks" β€” FALSE.** OCC is a meta-controller, not a direct reasoning improvement. It wins when there is clear agent/cost differentiation (code) and loses when the baseline already optimizes well (RAG+verifier).
105
  3. **"Simulated agents are sufficient for debate" β€” PARTIALLY FALSE.** The adversarial debate shows qualitative value (OCC filters bad agents), but quantitative compute savings in v1 were too small because all simulated agents used similar token counts. v2 with variable costs addresses this.
106
+ 4. **"Qwen-Coder can handle raw HumanEval prompts" β€” FALSE.** Chat templating fixed the input format, but code extraction from generated output remains problematic. The model generates code, but the heuristics for extracting runnable function bodies need significant improvement.
107
 
108
  ---
109
 
 
134
 
135
  ## Do the Anti-Gaming Mechanisms Matter?
136
 
137
+ **Yes, significantly.** We mapped our attack vectors onto the RS-OS taxonomy's 5 failure modes:
138
 
139
  | RS-OS Failure Mode (Β§6.3) | OCC Attack Test | Detection |
140
  |---|---|---|
 
185
  | `ledger/ledger.py` | Credit Ledger with decay and provenance |
186
  | `broker/broker.py` | Capability-based Resource Broker |
187
  | `rl/reward.py` | GRPO-compatible reward hook |
 
188
  | `rl/grpo_train_demo.py` | Offline comparator + training attempt |
189
+ | `grpo_hook.py` | TRL-compatible reward function factory |
190
  | `benchmarks/benchmark_code.py` | Code compute allocation benchmark |
191
  | `benchmarks/benchmark_retrieval_qa.py` | Retrieval QA benchmark |
192
  | `benchmarks/benchmark_retrieval_qa_nli.py` | QA with real NLI model |
 
195
  | `benchmarks/benchmark_debate_adversarial.py` | Debate with bad agents |
196
  | `jobs/run_real_llm_standalone.py` | Self-contained GPU job (v1) |
197
  | `jobs/run_real_llm_standalone_v2.py` | GPU job with chat template fix (v2) |
198
+ | `jobs/run_real_llm_standalone_v3.py` | Clean GPU job (v3) |
199
+ | `eval_runner.py` | Full evaluation + ablations + anti-gaming |
200
  | `reports/all_results.json` | All benchmark results (machine-readable) |
201
  | `reports/report.md` | This report |
202
  | `reports/blog_post.md` | Short blog post |