# ============================================================================ # td_start.td — The TD Self-Improvement Loop # ============================================================================ # # This is THE script. Run install.sh first, then: # python -m td_lang run td_start.td # # What it does: # 1. Loads the base model (Qwen3-VL-8B-Instruct) # 2. Merges in DeepSeek-R1 reasoning (safest merge first) # 3. Heals any damage from the merge # 4. Diagnoses weaknesses (mega diagnose: self-report + domain tests + speed) # 5. Generates synthetic training data for weak spots # 6. Trains with GRPO on the weak spots # 7. Runs the arena (real RL with memory + curiosity + anti-lying) # 8. Evaluates the result # 9. Saves a snapshot (so we can rollback if something goes wrong) # 10. Commits the improved model # # After this works, Phase 2 is: add mimo, llama, falcon merges and # run the self-improvement loop in a repeat block. # # Estimated time: 2-4 hours on dual RTX 4090 # ============================================================================ # --- Safety nets --- gate { must_pass = [canary, perplexity, thinking_mode] } budget { max_gpu_hours = 24.0 max_cost = 100.0 } # --- Reward rules (what counts as "good" during GRPO training) --- reward_contract { verifiers = [code_compiles, math_correct, no_hallucination] min_reward = 0.3 } # --- Step 1: Load the base model --- load "Qwen/Qwen3-VL-8B-Instruct" as base # --- Step 2: Merge in DeepSeek-R1 reasoning --- # This is the safest merge (same architecture, 99.9% vocab overlap) # Gives us deep reasoning abilities from R1 merge "deepseek-ai/DeepSeek-R1-0528-Qwen3-8B" into base using transport strength 0.5 # --- Step 2b: Merge in MiMo-7B reasoning --- # Medium risk: same layer count (36) and hidden_dim (4096) # MTP heads get dropped automatically (no Qwen3 equivalent) # Embeddings skipped (28% vocab overlap too low) merge "XiaomiMiMo/MiMo-7B-RL" into base using transport strength 0.15 # --- Step 3: Heal any merge damage --- # QLoRA fine-tune to smooth out rough edges from the merge heal base lora_r 32 epochs 2 # --- Step 4: Take a snapshot BEFORE training (safety net) --- snapshot base # --- Step 5: Mega diagnose — find weaknesses --- # Part 1: Ask the model "what are you bad at?" # Part 2: Test it on 12 questions (math, code, logic, factual) # Part 3: Measure per-layer speed diagnose base -> diagnose_results.json # --- Step 6: Generate synthetic training data for weak spots --- synth base from base filter cherry_llm -> synth_data.jsonl # --- Step 7: Train on weak spots with GRPO --- # The reward_contract verifiers are used automatically train base on "synth_data.jsonl" using grpo steps 100 lr 0.0001 # --- Step 8: STaR — learn from own correct reasoning --- # Generate multiple solutions, keep correct chains, train on them star base on "gsm8k" rounds 2 samples 8 # --- Step 9: Arena — real RL training --- # The model enters challenges, gets immediate reward/punishment, # remembers what worked, gets curiosity bonus for trying new things, # lying gets punished double arena base on "gsm8k" rounds 3 episodes 30 steps 32 curiosity 0.3 # --- Step 10: Evaluate the final result --- eval base -> final_eval.json # --- Step 11: Save the improved model --- snapshot base commit base # --- Done! --- # The model is now (hopefully) smarter than when we started. # Check final_eval.json to see how much it improved. # Check diagnose_results.json to see what was weak. # If results are good, next step: add more merges and run in a loop.