| # Batch Validation — 5 Cascade-Only Instances |
|
|
| **Job:** [6a04d3a33308d79117b8f24c](https://huggingface.co/jobs/narcolepticchicken/6a04d3a33308d79117b8f24c) |
| **Status:** RUNNING |
| **Started:** 2026-05-13 |
|
|
| ## Instances |
|
|
| These are the cascade-only instances from the corrected report — solved by T1 or T2 but NOT by either T4 model: |
|
|
| 1. `django__django-11815` |
| 2. `django__django-13089` |
| 3. `django__django-13807` |
| 4. `django__django-14315` (single-instance test: T2 produced valid 3997ch diff) |
| 5. `matplotlib__matplotlib-25224` |
|
|
| ## Approach |
|
|
| - **T1:** Llama-3.1-8B-Instruct (25 turn max) |
| - **T2:** Llama-3.3-70B-Instruct (20 turn max) |
| - **Protocol:** File editing (`<edit path='file'>content</edit>`) + `git diff` external |
| - **Validation:** `git apply --check` on generated patch |
| - **Cost:** $0 inference (HF free) + ~$2-3 compute |
|
|
| ## Expected Result |
|
|
| If the cascade thesis holds, we should see: |
| - 3-5 instances producing valid patches |
| - Mix of T1 and T2 solves |
| - Evidence that cheap models can solve instances that frontier models miss |
|
|
| ## Check Logs |
|
|
| ```bash |
| curl -s "https://huggingface.co/api/jobs/narcolepticchicken/6a04d3a33308d79117b8f24c/logs" |
| ``` |
|
|