narcolepticchicken
/

agent-cost-optimizer

Model card Files Files and versions

agent-cost-optimizer / BATCH_VALIDATION.md

narcolepticchicken's picture

narcolepticchicken

Upload BATCH_VALIDATION.md

b2dd223 verified 3 days ago

|

history blame contribute delete

1.15 kB

	# Batch Validation — 5 Cascade-Only Instances

	Job: [6a04d3a33308d79117b8f24c](https://huggingface.co/jobs/narcolepticchicken/6a04d3a33308d79117b8f24c)
	Status: RUNNING
	Started: 2026-05-13

	## Instances

	These are the cascade-only instances from the corrected report — solved by T1 or T2 but NOT by either T4 model:

	1. `django__django-11815`
	2. `django__django-13089`
	3. `django__django-13807`
	4. `django__django-14315` (single-instance test: T2 produced valid 3997ch diff)
	5. `matplotlib__matplotlib-25224`

	## Approach

	- T1: Llama-3.1-8B-Instruct (25 turn max)
	- T2: Llama-3.3-70B-Instruct (20 turn max)
	- Protocol: File editing (`<edit path='file'>content</edit>`) + `git diff` external
	- Validation: `git apply --check` on generated patch
	- Cost: $0 inference (HF free) + ~$2-3 compute

	## Expected Result

	If the cascade thesis holds, we should see:
	- 3-5 instances producing valid patches
	- Mix of T1 and T2 solves
	- Evidence that cheap models can solve instances that frontier models miss

	## Check Logs

	```bash
	curl -s "https://huggingface.co/api/jobs/narcolepticchicken/6a04d3a33308d79117b8f24c/logs"
	```