File size: 1,153 Bytes
b2dd223 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | # Batch Validation — 5 Cascade-Only Instances
**Job:** [6a04d3a33308d79117b8f24c](https://huggingface.co/jobs/narcolepticchicken/6a04d3a33308d79117b8f24c)
**Status:** RUNNING
**Started:** 2026-05-13
## Instances
These are the cascade-only instances from the corrected report — solved by T1 or T2 but NOT by either T4 model:
1. `django__django-11815`
2. `django__django-13089`
3. `django__django-13807`
4. `django__django-14315` (single-instance test: T2 produced valid 3997ch diff)
5. `matplotlib__matplotlib-25224`
## Approach
- **T1:** Llama-3.1-8B-Instruct (25 turn max)
- **T2:** Llama-3.3-70B-Instruct (20 turn max)
- **Protocol:** File editing (`<edit path='file'>content</edit>`) + `git diff` external
- **Validation:** `git apply --check` on generated patch
- **Cost:** $0 inference (HF free) + ~$2-3 compute
## Expected Result
If the cascade thesis holds, we should see:
- 3-5 instances producing valid patches
- Mix of T1 and T2 solves
- Evidence that cheap models can solve instances that frontier models miss
## Check Logs
```bash
curl -s "https://huggingface.co/api/jobs/narcolepticchicken/6a04d3a33308d79117b8f24c/logs"
```
|