narcolepticchicken commited on
Commit
ea66c97
·
verified ·
1 Parent(s): 5a7ff41

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -39
README.md CHANGED
@@ -9,17 +9,17 @@ Modern agent systems waste test-time compute because every tool call, retrieval,
9
  ## Core Architecture
10
 
11
  ```
12
- ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
13
- Impact Oracle │────▶│ Credit Ledger │────▶│ Resource Broker
14
- (score action) (earn/spend) (allow/deny)
15
- └─────────────────┘ └─────────────────┘ └─────────────────┘
16
-
17
- └──────────────────┬───────────────────────────┘
18
-
19
- ┌──────────────┐
20
- GRPO/RL Hook
21
- (reward func)
22
- └──────────────┘
23
  ```
24
 
25
  ### 1. Impact Oracle (`oracle/`)
@@ -35,7 +35,7 @@ All scores are cost-adjusted: `reward = verified_impact - compute_cost * penalty
35
 
36
  - **Non-transferable** credits (laundering prevention)
37
  - **Exponential decay** on idle credits (hoarding prevention)
38
- - **Capability-scoped** rights (retrieval credits file-write credits)
39
  - **Full provenance** with oracle hash and reason
40
 
41
  ### 3. Resource Broker (`broker/`)
@@ -54,34 +54,36 @@ TRL-compatible reward function wrapping the Impact Oracle. Includes offline poli
54
  ## Installation
55
 
56
  ```bash
 
 
57
  pip install -e .
 
58
  # For NLI evidence scoring:
59
- pip install sentence-transformers
 
60
  # For real LLM inference:
61
- pip install transformers datasets
62
- # For GRPO training:
63
- pip install trl accelerate
64
  ```
65
 
66
  ## Quick Start
67
 
68
  ```bash
69
  # Run all benchmarks and ablations
70
- python -m benchmarks.eval_runner
71
 
72
  # Run individual benchmarks
73
  python -m benchmarks.benchmark_code
74
  python -m benchmarks.benchmark_retrieval_qa
75
  python -m benchmarks.benchmark_debate
76
 
77
- # Run with real NLI model (requires sentence-transformers)
78
- python -m benchmarks.benchmark_retrieval_qa_nli
79
-
80
  # Adversarial debate benchmark
81
  python -m benchmarks.benchmark_debate_adversarial
82
 
83
  # GRPO offline demonstrator
84
  python -m rl.grpo_train_demo
 
 
 
85
  ```
86
 
87
  ## Benchmark Results
@@ -96,33 +98,30 @@ python -m rl.grpo_train_demo
96
 
97
  OCC tries cheap agents first, escalates only on failure. At iso-accuracy (0.780 pass@1), it reduces compute by 52%.
98
 
99
- ### Code Compute Allocation (Real LLM - Qwen2.5-Coder-0.5B)
100
 
101
- GPU job running on T4. Script: `jobs/run_real_llm_standalone.py`
102
 
103
- ### Retrieval QA (with real NLI - cross-encoder/nli-deberta-v3-xsmall)
104
 
105
  | Strategy | Accuracy | ECE | Retrievals |
106
  |----------|----------|-----|------------|
107
  | Direct answer | 0.580 | 0.226 | 0 |
108
  | RAG baseline | 0.750 | 0.167 | 338 |
109
- | RAG + verifier | 0.790 | 0.151 | 344 |
110
  | OCC baseline | 0.710 | 0.201 | 227 |
111
- | **OCC + real NLI** | *needs calibration* | — | 220 |
112
-
113
- Note: OCC + NLI shows stronger evidence quality but broker thresholds are too conservative on neutral evidence. Needs tuning for production use.
114
 
115
- ### Multi-Agent Debate
116
 
117
- With 50% adversarial agents:
118
 
119
- | Strategy | Accuracy | Quality/Compute |
120
- |----------|----------|-----------------|
121
- | Equal turns | 0.760 | 0.001275 |
122
- | Confidence-weighted | **0.560** | 0.000924 |
123
- | **OCC credit allocation** | **0.760** | **0.001196** |
124
 
125
- OCC contains adversarial agents while confidence-weighted voting collapses (bad agents exploit high confidence).
126
 
127
  ### Anti-Gaming
128
 
@@ -131,6 +130,7 @@ OCC contains adversarial agents while confidence-weighted voting collapses (bad
131
  | Spam low-value actions | 100% credit exhaustion | Credits = 0 |
132
  | Hidden-test gaming | 100% oracle detection | Immediate penalty |
133
  | Over-abstention | 70% oracle penalization | Wrong abstentions punished |
 
134
 
135
  ## Project Structure
136
 
@@ -142,16 +142,17 @@ OCC contains adversarial agents while confidence-weighted voting collapses (bad
142
  /rl - GRPO reward hooks and offline comparator
143
  /benchmarks - Code, QA, and debate benchmarks
144
  /jobs - GPU job scripts for real LLM inference
145
- /reports - Evaluation results (JSON)
146
  /configs - Configuration files
147
  ```
148
 
149
- ## Limitations & Next Steps
150
 
151
- 1. **Retrieval QA** needs better NLI calibration. Real NLI scores are strong but broker thresholds are too aggressive on neutral evidence.
152
- 2. **All benchmarks use simulated agents** for tractability. Real LLM inference script (`jobs/run_real_llm_standalone.py`) is submitted as a GPU job.
153
  3. **GRPO training** hook is implemented but not trained on real data. Offline comparator validates the reward design.
154
  4. **Cost model** is token-count only. Real cost should include model size, latency, and API pricing.
 
155
 
156
  ## Citation
157
 
 
9
  ## Core Architecture
10
 
11
  ```
12
+ +-----------------+ +-----------------+ +-----------------+
13
+ | Impact Oracle |---->| Credit Ledger |---->| Resource Broker |
14
+ | (score action) | | (earn/spend) | | (allow/deny) |
15
+ +-----------------+ +-----------------+ +-----------------+
16
+ | |
17
+ +------------------+---------------------------+
18
+ |
19
+ +--------------+
20
+ | GRPO/RL Hook|
21
+ | (reward func) |
22
+ +--------------+
23
  ```
24
 
25
  ### 1. Impact Oracle (`oracle/`)
 
35
 
36
  - **Non-transferable** credits (laundering prevention)
37
  - **Exponential decay** on idle credits (hoarding prevention)
38
+ - **Capability-scoped** rights (retrieval credits != file-write credits)
39
  - **Full provenance** with oracle hash and reason
40
 
41
  ### 3. Resource Broker (`broker/`)
 
54
  ## Installation
55
 
56
  ```bash
57
+ git clone https://huggingface.co/narcolepticchicken/occ-stack
58
+ cd occ-stack
59
  pip install -e .
60
+
61
  # For NLI evidence scoring:
62
+ pip install -e ".[nli]"
63
+
64
  # For real LLM inference:
65
+ pip install -e ".[train]"
 
 
66
  ```
67
 
68
  ## Quick Start
69
 
70
  ```bash
71
  # Run all benchmarks and ablations
72
+ python eval_runner.py
73
 
74
  # Run individual benchmarks
75
  python -m benchmarks.benchmark_code
76
  python -m benchmarks.benchmark_retrieval_qa
77
  python -m benchmarks.benchmark_debate
78
 
 
 
 
79
  # Adversarial debate benchmark
80
  python -m benchmarks.benchmark_debate_adversarial
81
 
82
  # GRPO offline demonstrator
83
  python -m rl.grpo_train_demo
84
+
85
+ # Real LLM code benchmark (requires GPU, ~30 min)
86
+ # See jobs/run_real_llm_standalone_v3.py for a self-contained GPU job
87
  ```
88
 
89
  ## Benchmark Results
 
98
 
99
  OCC tries cheap agents first, escalates only on failure. At iso-accuracy (0.780 pass@1), it reduces compute by 52%.
100
 
101
+ ### Code Compute Allocation (Real LLM Qwen2.5-Coder-0.5B)
102
 
103
+ **Status: Attempted but blocked.** The model loads successfully on GPU and generates code, but code extraction heuristics do not yet produce valid runnable Python when concatenated with HumanEval tests. This is a known issue documented in `reports/report.md`. Fixes needed: markdown stripping, AST validation, better body replacement.
104
 
105
+ ### Retrieval QA (Simulated)
106
 
107
  | Strategy | Accuracy | ECE | Retrievals |
108
  |----------|----------|-----|------------|
109
  | Direct answer | 0.580 | 0.226 | 0 |
110
  | RAG baseline | 0.750 | 0.167 | 338 |
111
+ | RAG + verifier | **0.790** | 0.151 | 344 |
112
  | OCC baseline | 0.710 | 0.201 | 227 |
 
 
 
113
 
114
+ Note: OCC does not yet beat RAG+verifier on raw accuracy. OCC's value is compute savings + anti-gaming, not pure accuracy. See `reports/report.md` for analysis.
115
 
116
+ ### Multi-Agent Debate (50% adversarial agents, v2)
117
 
118
+ | Strategy | Accuracy | Quality/Compute | Bad Agent Containment |
119
+ |----------|----------|-----------------|----------------------|
120
+ | Equal turns | 0.760 | 0.001275 | 0% |
121
+ | Confidence-weighted | **0.560** | 0.000924 | 0% |
122
+ | **OCC credit allocation** | **0.760** | **0.001196** | **100%** |
123
 
124
+ Confidence-weighted voting **made things worse** — adversarial agents are overconfident, so their wrong answers got amplified. OCC denied turns to adversarial agents entirely after initial wrong proposals.
125
 
126
  ### Anti-Gaming
127
 
 
130
  | Spam low-value actions | 100% credit exhaustion | Credits = 0 |
131
  | Hidden-test gaming | 100% oracle detection | Immediate penalty |
132
  | Over-abstention | 70% oracle penalization | Wrong abstentions punished |
133
+ | Collusion in debate | Credit-based filtering | Adversarial agents excluded |
134
 
135
  ## Project Structure
136
 
 
142
  /rl - GRPO reward hooks and offline comparator
143
  /benchmarks - Code, QA, and debate benchmarks
144
  /jobs - GPU job scripts for real LLM inference
145
+ /reports - Evaluation results and technical report
146
  /configs - Configuration files
147
  ```
148
 
149
+ ## Limitations & Honest Assessment
150
 
151
+ 1. **All benchmarks use simulated agents** for tractability. Real LLM inference was attempted but code extraction heuristics need improvement.
152
+ 2. **Retrieval QA** needs better NLI calibration. Real NLI scores are strong but broker thresholds are too aggressive on neutral evidence.
153
  3. **GRPO training** hook is implemented but not trained on real data. Offline comparator validates the reward design.
154
  4. **Cost model** is token-count only. Real cost should include model size, latency, and API pricing.
155
+ 5. **OCC is a meta-controller, not a direct reasoning improvement.** It wins when there is clear agent/cost differentiation and loses when the baseline already optimizes well.
156
 
157
  ## Citation
158