| # demo_loop.td — Self-improvement loop (Phase 2) | |
| # The core TD cycle: diagnose -> synth -> train -> evaluate -> commit | |
| gate { | |
| must_pass = [canary, perplexity, thinking_mode] | |
| } | |
| budget { | |
| max_gpu_hours = 10 | |
| max_cost = 40.00 | |
| } | |
| load "Qwen/Qwen3-VL-8B-Instruct" as base | |
| # Step 1: Ask the model what it's bad at | |
| diagnose base -> weaknesses.json | |
| # Step 2: Generate training data targeting those weaknesses | |
| synth base from web_curated filter cherry_llm -> synth_data.jsonl | |
| # Step 3: Train with GRPO (64 steps = sweet spot from test_15) | |
| train base on "synth_data.jsonl" using grpo steps 64 | |
| # Step 4: Check if it actually got better | |
| eval base -> post_training_eval.json | |
| # Step 5: Only save if gates pass | |
| commit base | |