Add benchmark harness documentation to README
Browse files
README.md
CHANGED
|
@@ -126,6 +126,97 @@ surgeon.modules["halluc_gate"].disable()
|
|
| 126 |
surgeon.save_cortex_modules("cortex_weights.pt")
|
| 127 |
```
|
| 128 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
## Design Principles
|
| 130 |
|
| 131 |
### 1. Zero-Init for Stable Injection
|
|
|
|
| 126 |
surgeon.save_cortex_modules("cortex_weights.pt")
|
| 127 |
```
|
| 128 |
|
| 129 |
+
## Benchmark Harness
|
| 130 |
+
|
| 131 |
+
Cortex includes a comprehensive benchmark harness for comparing base LLMs against Cortex-enhanced versions. It evaluates across standard NLP benchmarks and Cortex-specific capability tests.
|
| 132 |
+
|
| 133 |
+
### Standard Benchmarks
|
| 134 |
+
|
| 135 |
+
| Task | Type | Choices | Dataset | Few-Shot |
|
| 136 |
+
|------|------|---------|---------|----------|
|
| 137 |
+
| **HellaSwag** | Commonsense NLI | 4 | `Rowan/hellaswag` | 5-shot |
|
| 138 |
+
| **ARC-Easy** | Science QA | 3-5 | `allenai/ai2_arc` | 5-shot |
|
| 139 |
+
| **ARC-Challenge** | Science QA (hard) | 3-5 | `allenai/ai2_arc` | 5-shot |
|
| 140 |
+
| **PIQA** | Physical intuition | 2 | `gimmaru/piqa` | 0-shot |
|
| 141 |
+
| **WinoGrande** | Coreference | 2 | `allenai/winogrande` | 5-shot |
|
| 142 |
+
| **MMLU** | Multi-domain knowledge | 4 | `cais/mmlu` | 5-shot |
|
| 143 |
+
| **HaluEval** | Hallucination detection | 2 | `pminervini/HaluEval` | 0-shot |
|
| 144 |
+
|
| 145 |
+
### Cortex-Specific Benchmarks
|
| 146 |
+
|
| 147 |
+
| Task | Tests | Method |
|
| 148 |
+
|------|-------|--------|
|
| 149 |
+
| **Passkey Retrieval** | Long-context memory, attention to details | Generation + substring match at 128/256/512/1024 token contexts |
|
| 150 |
+
| **Multi-Hop Memory** | Compositional reasoning, fact chaining | Generation + answer extraction from 3-hop fact chains |
|
| 151 |
+
|
| 152 |
+
### Running Benchmarks
|
| 153 |
+
|
| 154 |
+
```bash
|
| 155 |
+
# Quick test (10 examples per task)
|
| 156 |
+
python -m benchmark.run_benchmark --n 10 --tasks hellaswag piqa
|
| 157 |
+
|
| 158 |
+
# Standard suite (50 examples, default tasks)
|
| 159 |
+
python -m benchmark.run_benchmark --n 50
|
| 160 |
+
|
| 161 |
+
# Full evaluation with all tasks
|
| 162 |
+
python -m benchmark.run_benchmark --n 0 --tasks hellaswag piqa arc-easy arc-challenge winogrande mmlu
|
| 163 |
+
|
| 164 |
+
# Custom model
|
| 165 |
+
python -m benchmark.run_benchmark --model meta-llama/Llama-3.2-1B --n 50
|
| 166 |
+
|
| 167 |
+
# Save JSON results
|
| 168 |
+
python -m benchmark.run_benchmark --n 50 --output results.json
|
| 169 |
+
|
| 170 |
+
# Skip memory benchmarks
|
| 171 |
+
python -m benchmark.run_benchmark --n 50 --no-memory
|
| 172 |
+
|
| 173 |
+
# Custom passkey test
|
| 174 |
+
python -m benchmark.run_benchmark --n 20 --passkey-lengths 128 256 512 1024 --n-passkey 10
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
### Scoring Method
|
| 178 |
+
|
| 179 |
+
- **Multiple-choice tasks:** Log-likelihood scoring — computes average log-probability the model assigns to each continuation, picks the highest. This is the standard approach used by lm-evaluation-harness and Open LLM Leaderboard.
|
| 180 |
+
- **Generation tasks:** Greedy decode + substring match against expected answer.
|
| 181 |
+
|
| 182 |
+
### Example Output (SmolLM2-135M, n=20)
|
| 183 |
+
|
| 184 |
+
```
|
| 185 |
+
======================================================================
|
| 186 |
+
BENCHMARK SUMMARY: HuggingFaceTB/SmolLM2-135M
|
| 187 |
+
n=20 per task, device=cuda
|
| 188 |
+
======================================================================
|
| 189 |
+
|
| 190 |
+
Task Base Cortex Delta
|
| 191 |
+
--------------------------------------------------
|
| 192 |
+
hellaswag 0.3500 0.5000 +0.1500 ↑
|
| 193 |
+
piqa 0.5000 0.5000 +0.0000
|
| 194 |
+
arc-easy 0.2500 0.4500 +0.2000 ↑
|
| 195 |
+
winogrande 0.6500 0.6500 +0.0000
|
| 196 |
+
passkey 1.0000 0.8889 -0.1111 ↓
|
| 197 |
+
multi_hop 0.6250 0.2500 -0.3750 ↓
|
| 198 |
+
|
| 199 |
+
Cortex overhead: 4,296,134 params (3.19%)
|
| 200 |
+
======================================================================
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
> **Note:** Cortex modules are untrained at injection (zero-initialized gates). The slight degradation on generation tasks (passkey, multi-hop) is expected — these require module training to improve. Standard log-likelihood tasks remain stable because zero-init gates are nearly transparent.
|
| 204 |
+
|
| 205 |
+
### Programmatic Usage
|
| 206 |
+
|
| 207 |
+
```python
|
| 208 |
+
from benchmark.runner import BenchmarkRunner
|
| 209 |
+
|
| 210 |
+
runner = BenchmarkRunner(model_name="HuggingFaceTB/SmolLM2-135M")
|
| 211 |
+
results = runner.run_comparison(
|
| 212 |
+
tasks=["hellaswag", "piqa", "arc-easy"],
|
| 213 |
+
n=50,
|
| 214 |
+
include_memory=True,
|
| 215 |
+
passkey_lengths=[128, 256, 512],
|
| 216 |
+
)
|
| 217 |
+
BenchmarkRunner.print_summary(results)
|
| 218 |
+
```
|
| 219 |
+
|
| 220 |
## Design Principles
|
| 221 |
|
| 222 |
### 1. Zero-Init for Stable Injection
|