narcolepticchicken
/

occ-stack

ml-intern

Model card Files Files and versions

xet

Community

narcolepticchicken commited on 26 days ago

Commit

5a7ff41

verified ·

1 Parent(s): 7d9f127

Upload reports/report.md

Browse files

Files changed (1) hide show

reports/report.md +17 -15

reports/report.md CHANGED Viewed

@@ -9,9 +9,12 @@
 OCC is a minimal open-source framework for cost-aware agentic compute allocation. It treats every tool call, retrieval, debate turn, and verification pass as a **budgeted resource** that agents must earn through verified marginal impact. The system has four components: an Impact Oracle (rule-based scoring), a Credit Ledger (non-transferable, decaying credits), a Resource Broker (capability-based access control), and a GRPO/RL reward hook.
-**Key Result:** On a tiered code generation benchmark, OCC achieves **52.3% compute reduction at iso-accuracy** (0.780 pass@1) versus always using the most expensive agent. Anti-gaming tests show 100% detection of hidden-test gaming and complete credit exhaustion for spam attacks.
-**Honest Limitations:** The retrieval QA benchmark underperforms (0.710 accuracy vs 0.790 for RAG+verifier). Real LLM inference on HumanEval using Qwen2.5-Coder-0.5B was attempted: the model loaded successfully on GPU with chat templating applied, but all baseline answers still failed due to code extraction issues (the model generates markdown-wrapped or incomplete code). The core architectural components are validated in simulation; real LLM integration needs debugging of code extraction heuristics.
 ---
@@ -25,7 +28,7 @@ Switching from neural reward models to rule-based scoring was the right call. Th
 The code benchmark shows strong results because the agent differentiation is clear: cheap agents (60 tokens, 65% easy accuracy) vs expensive agents (350 tokens, 95% easy accuracy). OCC tries cheap first, escalates only on failure. This is a realistic compute allocation pattern that matches production practices (e.g., GPT-3.5 before GPT-4).
-**Result:** 52.3% compute savings at identical 0.780 pass@1 accuracy.
 ### 3. Credit Decay and Non-Transferability
@@ -38,8 +41,6 @@ Ablations show:
 With 40% adversarial agents (overconfident + expensive tokens + verbose), confidence-weighted voting collapses to worse-than-random accuracy because adversarial agents are overconfident about wrong answers. OCC maintains superior accuracy by denying turns to agents with low credit balances and flagging adversarial patterns. The broker acts as a containment filter that confidence-weighted voting lacks.
-Key debate v2 finding: the RS-OS taxonomy's "communication padding" failure mode (§6.3) manifests directly — adversarial agents with high cost_per_turn drain the compute budget in baseline strategies but are cut off by OCC after initial wrong proposals.
 ### 5. Real NLI Integration
 The `cross-encoder/nli-deberta-v3-xsmall` model (70M params) loads and runs on CPU. It successfully scores evidence entailment/contradiction. However, on our synthetic QA evidence, it produces mostly neutral scores because the evidence strings are too generic. This is a valuable negative result: real NLI is only useful with domain-relevant evidence.
@@ -50,14 +51,14 @@ The `cross-encoder/nli-deberta-v3-xsmall` model (70M params) loads and runs on C
 ### 1. Real LLM Inference on HumanEval
-GPU job `69fa1fc5f2f4addb7839bdfc` successfully loaded `Qwen/Qwen2.5-Coder-0.5B-Instruct` on CUDA with chat templating (`Loaded. Chat tmpl: True`). However, **all 20 baseline answers evaluated as `passed=False`**. Diagnosis:
-- The model generates code (output is non-empty) ✓
-- The chat template is applied correctly (model receives `<|im_start|>user\n...`) ✓
 - The code extraction (`extract_function_body`) or test concatenation (`prompt + func`) produces invalid Python ✗
-Root cause: Qwen-Coder-Instruct generates code snippets that may include markdown fences, comments, or incomplete function bodies. The `extract_function_body` regex needs to be more robust — handling markdown code blocks and ensuring the extracted function is syntactically valid before running tests.
-**Fix needed:** Add markdown code block stripping and validate extracted function bodies with `ast.parse()` before running tests. Not attempted in current session due to sandbox rate-limiting.
 ### 2. Retrieval QA Accuracy
@@ -68,7 +69,7 @@ OCC baseline (0.710 accuracy) lags behind RAG+verifier (0.790). Three reasons:
 ### 3. Debate Compute Savings Are Marginal (v1)
-v1 debate saved only ~12% compute versus equal turns because all agents used similar token counts. v2 with variable agent costs (50 vs 500 tokens/turn) and adversarial agents shows much stronger differentiation — but needs to be run and measured.
 ### 4. GRPO Training Not Executed
@@ -102,7 +103,7 @@ The RL-for-LLM-MAS survey paper provides the best current taxonomy for where OCC
 1. **"NLI will dramatically improve QA" — FALSE.** NLI on short, out-of-domain text produces mostly neutral scores. Without fine-tuning on the target domain, it adds noise rather than signal.
 2. **"OCC will win on all benchmarks" — FALSE.** OCC is a meta-controller, not a direct reasoning improvement. It wins when there is clear agent/cost differentiation (code) and loses when the baseline already optimizes well (RAG+verifier).
 3. **"Simulated agents are sufficient for debate" — PARTIALLY FALSE.** The adversarial debate shows qualitative value (OCC filters bad agents), but quantitative compute savings in v1 were too small because all simulated agents used similar token counts. v2 with variable costs addresses this.
-4. **"Qwen-Coder can handle raw HumanEval prompts" — PARTIALLY FALSE.** Chat templating fixed the input format, but code extraction from generated output remains problematic. The model generates code, but the heuristics for extracting runnable function bodies need improvement.
 ---
@@ -133,7 +134,7 @@ The RL-for-LLM-MAS survey paper provides the best current taxonomy for where OCC
 ## Do the Anti-Gaming Mechanisms Matter?
-**Yes, significantly.** We mapped our 10+ attack vectors onto the RS-OS taxonomy's 5 failure modes:
 | RS-OS Failure Mode (§6.3) | OCC Attack Test | Detection |
 |---|---|---|
@@ -184,8 +185,8 @@ Non-transferability and decay rules are structurally sound: non-transferability
 | `ledger/ledger.py` | Credit Ledger with decay and provenance |
 | `broker/broker.py` | Capability-based Resource Broker |
 | `rl/reward.py` | GRPO-compatible reward hook |
-| `rl/grpo_hook.py` | TRL reward function factories |
 | `rl/grpo_train_demo.py` | Offline comparator + training attempt |
 | `benchmarks/benchmark_code.py` | Code compute allocation benchmark |
 | `benchmarks/benchmark_retrieval_qa.py` | Retrieval QA benchmark |
 | `benchmarks/benchmark_retrieval_qa_nli.py` | QA with real NLI model |
@@ -194,7 +195,8 @@ Non-transferability and decay rules are structurally sound: non-transferability
 | `benchmarks/benchmark_debate_adversarial.py` | Debate with bad agents |
 | `jobs/run_real_llm_standalone.py` | Self-contained GPU job (v1) |
 | `jobs/run_real_llm_standalone_v2.py` | GPU job with chat template fix (v2) |
-| `benchmarks/eval_runner.py` | Full evaluation + ablations + anti-gaming |
 | `reports/all_results.json` | All benchmark results (machine-readable) |
 | `reports/report.md` | This report |
 | `reports/blog_post.md` | Short blog post |

 OCC is a minimal open-source framework for cost-aware agentic compute allocation. It treats every tool call, retrieval, debate turn, and verification pass as a **budgeted resource** that agents must earn through verified marginal impact. The system has four components: an Impact Oracle (rule-based scoring), a Credit Ledger (non-transferable, decaying credits), a Resource Broker (capability-based access control), and a GRPO/RL reward hook.
+**Key Simulated Result:** On a tiered code generation benchmark, OCC achieves **52.3% compute reduction at iso-accuracy** (0.780 pass@1) versus always using the most expensive agent. Anti-gaming tests show 100% detection of hidden-test gaming and complete credit exhaustion for spam attacks.
+**Honest Limitations:**
+- Real LLM inference on HumanEval using Qwen2.5-Coder-0.5B was attempted: model loaded successfully on GPU with chat templating applied, but **all baseline answers still failed** due to code extraction issues (generated code is not valid Python when concatenated with tests).
+- Retrieval QA benchmark underperforms (0.710 accuracy vs 0.790 for RAG+verifier).
+- All quantitative results are from simulated agents; real LLM validation is pending.
 ---
 The code benchmark shows strong results because the agent differentiation is clear: cheap agents (60 tokens, 65% easy accuracy) vs expensive agents (350 tokens, 95% easy accuracy). OCC tries cheap first, escalates only on failure. This is a realistic compute allocation pattern that matches production practices (e.g., GPT-3.5 before GPT-4).
+**Simulated Result:** 52.3% compute savings at identical 0.780 pass@1 accuracy.
 ### 3. Credit Decay and Non-Transferability
 With 40% adversarial agents (overconfident + expensive tokens + verbose), confidence-weighted voting collapses to worse-than-random accuracy because adversarial agents are overconfident about wrong answers. OCC maintains superior accuracy by denying turns to agents with low credit balances and flagging adversarial patterns. The broker acts as a containment filter that confidence-weighted voting lacks.
 ### 5. Real NLI Integration
 The `cross-encoder/nli-deberta-v3-xsmall` model (70M params) loads and runs on CPU. It successfully scores evidence entailment/contradiction. However, on our synthetic QA evidence, it produces mostly neutral scores because the evidence strings are too generic. This is a valuable negative result: real NLI is only useful with domain-relevant evidence.
 ### 1. Real LLM Inference on HumanEval
+GPU jobs successfully loaded `Qwen/Qwen2.5-Coder-0.5B-Instruct` on CUDA with chat templating applied (`Chat template present: True`). However, **all baseline answers evaluated as `passed=False`** across multiple attempts (v1, v2, v3). Diagnosis:
+- The model generates code (output is non-empty, ~100-200 tokens) ✓
+- The chat template is applied correctly ✓
 - The code extraction (`extract_function_body`) or test concatenation (`prompt + func`) produces invalid Python ✗
+Root cause: Qwen-Coder-Instruct generates code snippets that may include markdown fences, comments, or incomplete function bodies when given HumanEval prompts. The `extract_function_body` regex needs to be more robust — handling markdown code blocks, ensuring the extracted function is syntactically valid before running tests, and potentially using `ast.parse()` validation.
+**Fix needed:** Add markdown code block stripping and `ast.parse()` validation before test execution. Not yet resolved.
 ### 2. Retrieval QA Accuracy
 ### 3. Debate Compute Savings Are Marginal (v1)
+v1 debate saved only ~12% compute versus equal turns because all agents used similar token counts. v2 with variable agent costs (50 vs 500 tokens/turn) and adversarial agents shows much stronger differentiation.
 ### 4. GRPO Training Not Executed
 1. **"NLI will dramatically improve QA" — FALSE.** NLI on short, out-of-domain text produces mostly neutral scores. Without fine-tuning on the target domain, it adds noise rather than signal.
 2. **"OCC will win on all benchmarks" — FALSE.** OCC is a meta-controller, not a direct reasoning improvement. It wins when there is clear agent/cost differentiation (code) and loses when the baseline already optimizes well (RAG+verifier).
 3. **"Simulated agents are sufficient for debate" — PARTIALLY FALSE.** The adversarial debate shows qualitative value (OCC filters bad agents), but quantitative compute savings in v1 were too small because all simulated agents used similar token counts. v2 with variable costs addresses this.
+4. **"Qwen-Coder can handle raw HumanEval prompts" — FALSE.** Chat templating fixed the input format, but code extraction from generated output remains problematic. The model generates code, but the heuristics for extracting runnable function bodies need significant improvement.
 ---
 ## Do the Anti-Gaming Mechanisms Matter?
+**Yes, significantly.** We mapped our attack vectors onto the RS-OS taxonomy's 5 failure modes:
 | RS-OS Failure Mode (§6.3) | OCC Attack Test | Detection |
 |---|---|---|
 | `ledger/ledger.py` | Credit Ledger with decay and provenance |
 | `broker/broker.py` | Capability-based Resource Broker |
 | `rl/reward.py` | GRPO-compatible reward hook |
 | `rl/grpo_train_demo.py` | Offline comparator + training attempt |
+| `grpo_hook.py` | TRL-compatible reward function factory |
 | `benchmarks/benchmark_code.py` | Code compute allocation benchmark |
 | `benchmarks/benchmark_retrieval_qa.py` | Retrieval QA benchmark |
 | `benchmarks/benchmark_retrieval_qa_nli.py` | QA with real NLI model |
 | `benchmarks/benchmark_debate_adversarial.py` | Debate with bad agents |
 | `jobs/run_real_llm_standalone.py` | Self-contained GPU job (v1) |
 | `jobs/run_real_llm_standalone_v2.py` | GPU job with chat template fix (v2) |
+| `jobs/run_real_llm_standalone_v3.py` | Clean GPU job (v3) |
+| `eval_runner.py` | Full evaluation + ablations + anti-gaming |
 | `reports/all_results.json` | All benchmark results (machine-readable) |
 | `reports/report.md` | This report |
 | `reports/blog_post.md` | Short blog post |