theapemachine
/

cortex

Model card Files Files and versions

xet

Community

theapemachine commited on Apr 27

Commit

9522390

verified ·

1 Parent(s): 18adc2c

Add benchmark harness documentation to README

Browse files

Files changed (1) hide show

README.md +91 -0

README.md CHANGED Viewed

@@ -126,6 +126,97 @@ surgeon.modules["halluc_gate"].disable()
 surgeon.save_cortex_modules("cortex_weights.pt")
 ```
 ## Design Principles
 ### 1. Zero-Init for Stable Injection

 surgeon.save_cortex_modules("cortex_weights.pt")
 ```
+## Benchmark Harness
+Cortex includes a comprehensive benchmark harness for comparing base LLMs against Cortex-enhanced versions. It evaluates across standard NLP benchmarks and Cortex-specific capability tests.
+### Standard Benchmarks
+| Task | Type | Choices | Dataset | Few-Shot |
+|------|------|---------|---------|----------|
+| **HellaSwag** | Commonsense NLI | 4 | `Rowan/hellaswag` | 5-shot |
+| **ARC-Easy** | Science QA | 3-5 | `allenai/ai2_arc` | 5-shot |
+| **ARC-Challenge** | Science QA (hard) | 3-5 | `allenai/ai2_arc` | 5-shot |
+| **PIQA** | Physical intuition | 2 | `gimmaru/piqa` | 0-shot |
+| **WinoGrande** | Coreference | 2 | `allenai/winogrande` | 5-shot |
+| **MMLU** | Multi-domain knowledge | 4 | `cais/mmlu` | 5-shot |
+| **HaluEval** | Hallucination detection | 2 | `pminervini/HaluEval` | 0-shot |
+### Cortex-Specific Benchmarks
+| Task | Tests | Method |
+|------|-------|--------|
+| **Passkey Retrieval** | Long-context memory, attention to details | Generation + substring match at 128/256/512/1024 token contexts |
+| **Multi-Hop Memory** | Compositional reasoning, fact chaining | Generation + answer extraction from 3-hop fact chains |
+### Running Benchmarks
+```bash
+# Quick test (10 examples per task)
+python -m benchmark.run_benchmark --n 10 --tasks hellaswag piqa
+# Standard suite (50 examples, default tasks)
+python -m benchmark.run_benchmark --n 50
+# Full evaluation with all tasks
+python -m benchmark.run_benchmark --n 0 --tasks hellaswag piqa arc-easy arc-challenge winogrande mmlu
+# Custom model
+python -m benchmark.run_benchmark --model meta-llama/Llama-3.2-1B --n 50
+# Save JSON results
+python -m benchmark.run_benchmark --n 50 --output results.json
+# Skip memory benchmarks
+python -m benchmark.run_benchmark --n 50 --no-memory
+# Custom passkey test
+python -m benchmark.run_benchmark --n 20 --passkey-lengths 128 256 512 1024 --n-passkey 10
+```
+### Scoring Method
+- **Multiple-choice tasks:** Log-likelihood scoring — computes average log-probability the model assigns to each continuation, picks the highest. This is the standard approach used by lm-evaluation-harness and Open LLM Leaderboard.
+- **Generation tasks:** Greedy decode + substring match against expected answer.
+### Example Output (SmolLM2-135M, n=20)
+```
+======================================================================
+BENCHMARK SUMMARY: HuggingFaceTB/SmolLM2-135M
+n=20 per task, device=cuda
+======================================================================
+Task                       Base   Cortex    Delta
+--------------------------------------------------
+hellaswag                0.3500   0.5000  +0.1500 ↑
+piqa                     0.5000   0.5000  +0.0000
+arc-easy                 0.2500   0.4500  +0.2000 ↑
+winogrande               0.6500   0.6500  +0.0000
+passkey                  1.0000   0.8889  -0.1111 ↓
+multi_hop                0.6250   0.2500  -0.3750 ↓
+Cortex overhead: 4,296,134 params (3.19%)
+======================================================================
+```
+> **Note:** Cortex modules are untrained at injection (zero-initialized gates). The slight degradation on generation tasks (passkey, multi-hop) is expected — these require module training to improve. Standard log-likelihood tasks remain stable because zero-init gates are nearly transparent.
+### Programmatic Usage
+```python
+from benchmark.runner import BenchmarkRunner
+runner = BenchmarkRunner(model_name="HuggingFaceTB/SmolLM2-135M")
+results = runner.run_comparison(
+    tasks=["hellaswag", "piqa", "arc-easy"],
+    n=50,
+    include_memory=True,
+    passkey_lengths=[128, 256, 512],
+)
+BenchmarkRunner.print_summary(results)
+```
 ## Design Principles
 ### 1. Zero-Init for Stable Injection