agent-cost-optimizer / BATCH_VALIDATION.md
narcolepticchicken's picture
Upload BATCH_VALIDATION.md
b2dd223 verified
# Batch Validation — 5 Cascade-Only Instances
**Job:** [6a04d3a33308d79117b8f24c](https://huggingface.co/jobs/narcolepticchicken/6a04d3a33308d79117b8f24c)
**Status:** RUNNING
**Started:** 2026-05-13
## Instances
These are the cascade-only instances from the corrected report — solved by T1 or T2 but NOT by either T4 model:
1. `django__django-11815`
2. `django__django-13089`
3. `django__django-13807`
4. `django__django-14315` (single-instance test: T2 produced valid 3997ch diff)
5. `matplotlib__matplotlib-25224`
## Approach
- **T1:** Llama-3.1-8B-Instruct (25 turn max)
- **T2:** Llama-3.3-70B-Instruct (20 turn max)
- **Protocol:** File editing (`<edit path='file'>content</edit>`) + `git diff` external
- **Validation:** `git apply --check` on generated patch
- **Cost:** $0 inference (HF free) + ~$2-3 compute
## Expected Result
If the cascade thesis holds, we should see:
- 3-5 instances producing valid patches
- Mix of T1 and T2 solves
- Evidence that cheap models can solve instances that frontier models miss
## Check Logs
```bash
curl -s "https://huggingface.co/api/jobs/narcolepticchicken/6a04d3a33308d79117b8f24c/logs"
```