agent-cost-optimizer / BATCH_VALIDATION.md
narcolepticchicken's picture
Upload BATCH_VALIDATION.md
b2dd223 verified

Batch Validation — 5 Cascade-Only Instances

Job: 6a04d3a33308d79117b8f24c Status: RUNNING Started: 2026-05-13

Instances

These are the cascade-only instances from the corrected report — solved by T1 or T2 but NOT by either T4 model:

  1. django__django-11815
  2. django__django-13089
  3. django__django-13807
  4. django__django-14315 (single-instance test: T2 produced valid 3997ch diff)
  5. matplotlib__matplotlib-25224

Approach

  • T1: Llama-3.1-8B-Instruct (25 turn max)
  • T2: Llama-3.3-70B-Instruct (20 turn max)
  • Protocol: File editing (<edit path='file'>content</edit>) + git diff external
  • Validation: git apply --check on generated patch
  • Cost: $0 inference (HF free) + ~$2-3 compute

Expected Result

If the cascade thesis holds, we should see:

  • 3-5 instances producing valid patches
  • Mix of T1 and T2 solves
  • Evidence that cheap models can solve instances that frontier models miss

Check Logs

curl -s "https://huggingface.co/api/jobs/narcolepticchicken/6a04d3a33308d79117b8f24c/logs"