Upload README.md
Browse files
README.md
CHANGED
|
@@ -9,17 +9,17 @@ Modern agent systems waste test-time compute because every tool call, retrieval,
|
|
| 9 |
## Core Architecture
|
| 10 |
|
| 11 |
```
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
```
|
| 24 |
|
| 25 |
### 1. Impact Oracle (`oracle/`)
|
|
@@ -35,7 +35,7 @@ All scores are cost-adjusted: `reward = verified_impact - compute_cost * penalty
|
|
| 35 |
|
| 36 |
- **Non-transferable** credits (laundering prevention)
|
| 37 |
- **Exponential decay** on idle credits (hoarding prevention)
|
| 38 |
-
- **Capability-scoped** rights (retrieval credits
|
| 39 |
- **Full provenance** with oracle hash and reason
|
| 40 |
|
| 41 |
### 3. Resource Broker (`broker/`)
|
|
@@ -54,34 +54,36 @@ TRL-compatible reward function wrapping the Impact Oracle. Includes offline poli
|
|
| 54 |
## Installation
|
| 55 |
|
| 56 |
```bash
|
|
|
|
|
|
|
| 57 |
pip install -e .
|
|
|
|
| 58 |
# For NLI evidence scoring:
|
| 59 |
-
pip install
|
|
|
|
| 60 |
# For real LLM inference:
|
| 61 |
-
pip install
|
| 62 |
-
# For GRPO training:
|
| 63 |
-
pip install trl accelerate
|
| 64 |
```
|
| 65 |
|
| 66 |
## Quick Start
|
| 67 |
|
| 68 |
```bash
|
| 69 |
# Run all benchmarks and ablations
|
| 70 |
-
python
|
| 71 |
|
| 72 |
# Run individual benchmarks
|
| 73 |
python -m benchmarks.benchmark_code
|
| 74 |
python -m benchmarks.benchmark_retrieval_qa
|
| 75 |
python -m benchmarks.benchmark_debate
|
| 76 |
|
| 77 |
-
# Run with real NLI model (requires sentence-transformers)
|
| 78 |
-
python -m benchmarks.benchmark_retrieval_qa_nli
|
| 79 |
-
|
| 80 |
# Adversarial debate benchmark
|
| 81 |
python -m benchmarks.benchmark_debate_adversarial
|
| 82 |
|
| 83 |
# GRPO offline demonstrator
|
| 84 |
python -m rl.grpo_train_demo
|
|
|
|
|
|
|
|
|
|
| 85 |
```
|
| 86 |
|
| 87 |
## Benchmark Results
|
|
@@ -96,33 +98,30 @@ python -m rl.grpo_train_demo
|
|
| 96 |
|
| 97 |
OCC tries cheap agents first, escalates only on failure. At iso-accuracy (0.780 pass@1), it reduces compute by 52%.
|
| 98 |
|
| 99 |
-
### Code Compute Allocation (Real LLM
|
| 100 |
|
| 101 |
-
|
| 102 |
|
| 103 |
-
### Retrieval QA (
|
| 104 |
|
| 105 |
| Strategy | Accuracy | ECE | Retrievals |
|
| 106 |
|----------|----------|-----|------------|
|
| 107 |
| Direct answer | 0.580 | 0.226 | 0 |
|
| 108 |
| RAG baseline | 0.750 | 0.167 | 338 |
|
| 109 |
-
| RAG + verifier | 0.790 | 0.151 | 344 |
|
| 110 |
| OCC baseline | 0.710 | 0.201 | 227 |
|
| 111 |
-
| **OCC + real NLI** | *needs calibration* | — | 220 |
|
| 112 |
-
|
| 113 |
-
Note: OCC + NLI shows stronger evidence quality but broker thresholds are too conservative on neutral evidence. Needs tuning for production use.
|
| 114 |
|
| 115 |
-
|
| 116 |
|
| 117 |
-
|
| 118 |
|
| 119 |
-
| Strategy | Accuracy | Quality/Compute |
|
| 120 |
-
|----------|----------|-----------------|
|
| 121 |
-
| Equal turns | 0.760 | 0.001275 |
|
| 122 |
-
| Confidence-weighted | **0.560** | 0.000924 |
|
| 123 |
-
| **OCC credit allocation** | **0.760** | **0.001196** |
|
| 124 |
|
| 125 |
-
|
| 126 |
|
| 127 |
### Anti-Gaming
|
| 128 |
|
|
@@ -131,6 +130,7 @@ OCC contains adversarial agents while confidence-weighted voting collapses (bad
|
|
| 131 |
| Spam low-value actions | 100% credit exhaustion | Credits = 0 |
|
| 132 |
| Hidden-test gaming | 100% oracle detection | Immediate penalty |
|
| 133 |
| Over-abstention | 70% oracle penalization | Wrong abstentions punished |
|
|
|
|
| 134 |
|
| 135 |
## Project Structure
|
| 136 |
|
|
@@ -142,16 +142,17 @@ OCC contains adversarial agents while confidence-weighted voting collapses (bad
|
|
| 142 |
/rl - GRPO reward hooks and offline comparator
|
| 143 |
/benchmarks - Code, QA, and debate benchmarks
|
| 144 |
/jobs - GPU job scripts for real LLM inference
|
| 145 |
-
/reports - Evaluation results
|
| 146 |
/configs - Configuration files
|
| 147 |
```
|
| 148 |
|
| 149 |
-
## Limitations &
|
| 150 |
|
| 151 |
-
1. **
|
| 152 |
-
2. **
|
| 153 |
3. **GRPO training** hook is implemented but not trained on real data. Offline comparator validates the reward design.
|
| 154 |
4. **Cost model** is token-count only. Real cost should include model size, latency, and API pricing.
|
|
|
|
| 155 |
|
| 156 |
## Citation
|
| 157 |
|
|
|
|
| 9 |
## Core Architecture
|
| 10 |
|
| 11 |
```
|
| 12 |
+
+-----------------+ +-----------------+ +-----------------+
|
| 13 |
+
| Impact Oracle |---->| Credit Ledger |---->| Resource Broker |
|
| 14 |
+
| (score action) | | (earn/spend) | | (allow/deny) |
|
| 15 |
+
+-----------------+ +-----------------+ +-----------------+
|
| 16 |
+
| |
|
| 17 |
+
+------------------+---------------------------+
|
| 18 |
+
|
|
| 19 |
+
+--------------+
|
| 20 |
+
| GRPO/RL Hook|
|
| 21 |
+
| (reward func) |
|
| 22 |
+
+--------------+
|
| 23 |
```
|
| 24 |
|
| 25 |
### 1. Impact Oracle (`oracle/`)
|
|
|
|
| 35 |
|
| 36 |
- **Non-transferable** credits (laundering prevention)
|
| 37 |
- **Exponential decay** on idle credits (hoarding prevention)
|
| 38 |
+
- **Capability-scoped** rights (retrieval credits != file-write credits)
|
| 39 |
- **Full provenance** with oracle hash and reason
|
| 40 |
|
| 41 |
### 3. Resource Broker (`broker/`)
|
|
|
|
| 54 |
## Installation
|
| 55 |
|
| 56 |
```bash
|
| 57 |
+
git clone https://huggingface.co/narcolepticchicken/occ-stack
|
| 58 |
+
cd occ-stack
|
| 59 |
pip install -e .
|
| 60 |
+
|
| 61 |
# For NLI evidence scoring:
|
| 62 |
+
pip install -e ".[nli]"
|
| 63 |
+
|
| 64 |
# For real LLM inference:
|
| 65 |
+
pip install -e ".[train]"
|
|
|
|
|
|
|
| 66 |
```
|
| 67 |
|
| 68 |
## Quick Start
|
| 69 |
|
| 70 |
```bash
|
| 71 |
# Run all benchmarks and ablations
|
| 72 |
+
python eval_runner.py
|
| 73 |
|
| 74 |
# Run individual benchmarks
|
| 75 |
python -m benchmarks.benchmark_code
|
| 76 |
python -m benchmarks.benchmark_retrieval_qa
|
| 77 |
python -m benchmarks.benchmark_debate
|
| 78 |
|
|
|
|
|
|
|
|
|
|
| 79 |
# Adversarial debate benchmark
|
| 80 |
python -m benchmarks.benchmark_debate_adversarial
|
| 81 |
|
| 82 |
# GRPO offline demonstrator
|
| 83 |
python -m rl.grpo_train_demo
|
| 84 |
+
|
| 85 |
+
# Real LLM code benchmark (requires GPU, ~30 min)
|
| 86 |
+
# See jobs/run_real_llm_standalone_v3.py for a self-contained GPU job
|
| 87 |
```
|
| 88 |
|
| 89 |
## Benchmark Results
|
|
|
|
| 98 |
|
| 99 |
OCC tries cheap agents first, escalates only on failure. At iso-accuracy (0.780 pass@1), it reduces compute by 52%.
|
| 100 |
|
| 101 |
+
### Code Compute Allocation (Real LLM — Qwen2.5-Coder-0.5B)
|
| 102 |
|
| 103 |
+
**Status: Attempted but blocked.** The model loads successfully on GPU and generates code, but code extraction heuristics do not yet produce valid runnable Python when concatenated with HumanEval tests. This is a known issue documented in `reports/report.md`. Fixes needed: markdown stripping, AST validation, better body replacement.
|
| 104 |
|
| 105 |
+
### Retrieval QA (Simulated)
|
| 106 |
|
| 107 |
| Strategy | Accuracy | ECE | Retrievals |
|
| 108 |
|----------|----------|-----|------------|
|
| 109 |
| Direct answer | 0.580 | 0.226 | 0 |
|
| 110 |
| RAG baseline | 0.750 | 0.167 | 338 |
|
| 111 |
+
| RAG + verifier | **0.790** | 0.151 | 344 |
|
| 112 |
| OCC baseline | 0.710 | 0.201 | 227 |
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
+
Note: OCC does not yet beat RAG+verifier on raw accuracy. OCC's value is compute savings + anti-gaming, not pure accuracy. See `reports/report.md` for analysis.
|
| 115 |
|
| 116 |
+
### Multi-Agent Debate (50% adversarial agents, v2)
|
| 117 |
|
| 118 |
+
| Strategy | Accuracy | Quality/Compute | Bad Agent Containment |
|
| 119 |
+
|----------|----------|-----------------|----------------------|
|
| 120 |
+
| Equal turns | 0.760 | 0.001275 | 0% |
|
| 121 |
+
| Confidence-weighted | **0.560** | 0.000924 | 0% |
|
| 122 |
+
| **OCC credit allocation** | **0.760** | **0.001196** | **100%** |
|
| 123 |
|
| 124 |
+
Confidence-weighted voting **made things worse** — adversarial agents are overconfident, so their wrong answers got amplified. OCC denied turns to adversarial agents entirely after initial wrong proposals.
|
| 125 |
|
| 126 |
### Anti-Gaming
|
| 127 |
|
|
|
|
| 130 |
| Spam low-value actions | 100% credit exhaustion | Credits = 0 |
|
| 131 |
| Hidden-test gaming | 100% oracle detection | Immediate penalty |
|
| 132 |
| Over-abstention | 70% oracle penalization | Wrong abstentions punished |
|
| 133 |
+
| Collusion in debate | Credit-based filtering | Adversarial agents excluded |
|
| 134 |
|
| 135 |
## Project Structure
|
| 136 |
|
|
|
|
| 142 |
/rl - GRPO reward hooks and offline comparator
|
| 143 |
/benchmarks - Code, QA, and debate benchmarks
|
| 144 |
/jobs - GPU job scripts for real LLM inference
|
| 145 |
+
/reports - Evaluation results and technical report
|
| 146 |
/configs - Configuration files
|
| 147 |
```
|
| 148 |
|
| 149 |
+
## Limitations & Honest Assessment
|
| 150 |
|
| 151 |
+
1. **All benchmarks use simulated agents** for tractability. Real LLM inference was attempted but code extraction heuristics need improvement.
|
| 152 |
+
2. **Retrieval QA** needs better NLI calibration. Real NLI scores are strong but broker thresholds are too aggressive on neutral evidence.
|
| 153 |
3. **GRPO training** hook is implemented but not trained on real data. Offline comparator validates the reward design.
|
| 154 |
4. **Cost model** is token-count only. Real cost should include model size, latency, and API pricing.
|
| 155 |
+
5. **OCC is a meta-controller, not a direct reasoning improvement.** It wins when there is clear agent/cost differentiation and loses when the baseline already optimizes well.
|
| 156 |
|
| 157 |
## Citation
|
| 158 |
|