File size: 1,153 Bytes
b2dd223
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Batch Validation — 5 Cascade-Only Instances

**Job:** [6a04d3a33308d79117b8f24c](https://huggingface.co/jobs/narcolepticchicken/6a04d3a33308d79117b8f24c)
**Status:** RUNNING
**Started:** 2026-05-13

## Instances

These are the cascade-only instances from the corrected report — solved by T1 or T2 but NOT by either T4 model:

1. `django__django-11815` 
2. `django__django-13089`
3. `django__django-13807`
4. `django__django-14315` (single-instance test: T2 produced valid 3997ch diff)
5. `matplotlib__matplotlib-25224`

## Approach

- **T1:** Llama-3.1-8B-Instruct (25 turn max)
- **T2:** Llama-3.3-70B-Instruct (20 turn max)
- **Protocol:** File editing (`<edit path='file'>content</edit>`) + `git diff` external
- **Validation:** `git apply --check` on generated patch
- **Cost:** $0 inference (HF free) + ~$2-3 compute

## Expected Result

If the cascade thesis holds, we should see:
- 3-5 instances producing valid patches
- Mix of T1 and T2 solves
- Evidence that cheap models can solve instances that frontier models miss

## Check Logs

```bash
curl -s "https://huggingface.co/api/jobs/narcolepticchicken/6a04d3a33308d79117b8f24c/logs"
```