Retrying due to status code 502. text=
======================================================================
MEMORY ROUTING AGENT - FULL TRAINING PIPELINE
======================================================================
Experiment: memory_routing_v1
Output: training/experiments/memory_routing_v1_20251124_165000
Base model: meta-llama/Llama-3.1-8B
LoRA rank: 32

======================================================================
PHASE 1: SUPERVISED FINE-TUNING
======================================================================
Train: 800, Test: 200
Learning rate: 2.86e-04
Step   0: train_loss=3.4228, test_loss=2.6279, time=3.3s
Step   1: train_loss=2.5284, time=34.7s
Step   2: train_loss=2.0672, time=4.1s
Step   3: train_loss=1.7094, time=4.3s
Step   4: train_loss=1.5843, time=2.5s
Step   5: train_loss=1.4973, time=3.0s
Step   6: train_loss=1.3900, time=4.6s
Step   7: train_loss=1.4226, time=24.7s
Step   8: train_loss=1.3094, time=2.6s
Step   9: train_loss=1.3240, time=3.4s
Step  10: train_loss=1.1783, test_loss=1.1197, time=2.9s
Step  11: train_loss=1.1683, time=3.0s
Step  12: train_loss=1.2817, time=3.1s
Step  13: train_loss=0.9658, time=2.4s
Step  14: train_loss=0.8791, time=34.4s
Step  15: train_loss=0.7782, time=33.0s
Step  16: train_loss=0.7206, time=3.1s
Step  17: train_loss=0.6524, time=2.4s
Step  18: train_loss=0.5603, time=2.9s
Step  19: train_loss=0.5045, time=4.4s
Step  20: train_loss=0.4175, test_loss=0.3288, time=2.7s
Step  21: train_loss=0.3219, time=2.2s
Step  22: train_loss=0.3643, time=2.4s
Step  23: train_loss=0.3799, time=2.1s
Step  24: train_loss=0.3603, time=2.4s
Step  25: train_loss=0.5269, time=1.9s
Step  26: train_loss=0.3044, time=29.7s
Step  27: train_loss=0.2869, time=3.5s
Step  28: train_loss=0.2994, time=4.4s
Step  29: train_loss=0.3266, time=2.2s
Step  30: train_loss=0.3303, test_loss=0.2598, time=2.3s
Step  31: train_loss=0.2958, time=1.8s
Step  32: train_loss=0.3050, time=2.0s
Step  33: train_loss=0.3092, time=33.7s
Step  34: train_loss=0.2802, time=2.1s
Step  35: train_loss=0.3087, time=2.0s
Step  36: train_loss=0.3042, time=2.0s
Step  37: train_loss=0.4495, time=3.2s
Step  38: train_loss=0.2939, time=2.0s
Step  39: train_loss=0.2473, time=2.0s
Step  40: train_loss=0.2092, test_loss=0.2544, time=2.8s
Step  41: train_loss=0.2836, time=2.9s
Step  42: train_loss=0.2363, time=2.0s
Step  43: train_loss=0.2641, time=2.1s
Step  44: train_loss=0.2647, time=2.2s
Step  45: train_loss=0.2634, time=3.5s
Step  46: train_loss=0.2576, time=2.7s
Step  47: train_loss=0.2471, time=2.5s
Step  48: train_loss=0.2778, time=2.7s
Step  49: train_loss=0.2875, time=7.9s
Step  50: train_loss=0.4188, test_loss=0.2334, time=2.2s
Step  51: train_loss=0.2511, time=2.7s
Step  52: train_loss=0.1968, time=28.9s
Step  53: train_loss=0.2182, time=2.8s
Step  54: train_loss=0.2473, time=34.8s
Step  55: train_loss=0.2404, time=2.6s
Step  56: train_loss=0.2247, time=2.5s
Step  57: train_loss=0.2161, time=2.2s
Step  58: train_loss=0.2167, time=1.9s
Step  59: train_loss=0.2116, time=2.1s
Step  60: train_loss=0.2304, test_loss=0.2018, time=3.1s
Step  61: train_loss=0.2512, time=2.8s
Step  62: train_loss=0.2886, time=2.0s
Step  63: train_loss=0.2893, time=1.9s
Step  64: train_loss=0.2319, time=2.0s
Step  65: train_loss=0.1766, time=1.9s
Step  66: train_loss=0.2583, time=2.3s
Step  67: train_loss=0.2068, time=3.1s
Step  68: train_loss=0.2338, time=2.5s
Step  69: train_loss=0.2009, time=2.0s
Step  70: train_loss=0.1942, test_loss=0.1832, time=2.6s
Step  71: train_loss=0.2030, time=2.2s
Step  72: train_loss=0.1983, time=24.0s
Step  73: train_loss=0.2216, time=2.8s
Step  74: train_loss=0.2449, time=2.7s
Step  75: train_loss=0.3014, time=2.8s
Step  76: train_loss=0.2157, time=2.8s
Step  77: train_loss=0.2117, time=16.5s
Step  78: train_loss=0.2102, time=32.4s
Step  79: train_loss=0.2355, time=2.1s
Step  80: train_loss=0.2199, test_loss=0.1973, time=2.3s
Step  81: train_loss=0.2125, time=3.6s
Step  82: train_loss=0.2148, time=2.2s
Step  83: train_loss=0.1887, time=2.5s
Step  84: train_loss=0.1713, time=31.9s
Step  85: train_loss=0.2361, time=2.3s
Step  86: train_loss=0.1958, time=35.1s
Step  87: train_loss=0.2396, time=2.3s
Step  88: train_loss=0.2032, time=32.1s
Step  89: train_loss=0.1682, time=82.7s
Step  90: train_loss=0.1952, test_loss=0.1960, time=2.6s
Step  91: train_loss=0.2146, time=2.3s
Step  92: train_loss=0.1845, time=28.6s
Step  93: train_loss=0.2103, time=3.3s
Step  94: train_loss=0.1943, time=3.3s
Step  95: train_loss=0.1729, time=3.1s
Step  96: train_loss=0.1698, time=2.8s
Step  97: train_loss=0.2020, time=3.2s
Step  98: train_loss=0.1963, time=3.6s
Step  99: train_loss=0.2097, test_loss=0.1150, time=3.1s

Saving final SFT checkpoint...
SFT State checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/weights/sft_final
SFT Sampler checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/sampler_weights/sft_final_sampler

--- Evaluating: SFT Model ---
  Evaluated 50/200
  Evaluated 100/200
  Evaluated 150/200
  Evaluated 200/200
  Any Match: 87.0%
  Exact Match: 39.0%
  F1: 69.2%
  Mean Reward: 0.772

======================================================================
PHASE 2: REINFORCEMENT LEARNING
======================================================================
Training examples: 800
RL iterations: 15
Batch size: 32, Group size: 8

Loading SFT checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/weights/sft_final

--- Iteration 1/15 ---
  Reward: 0.872 ± 0.192, Acc: 100.0%, Format: 100.0%

--- Iteration 2/15 ---
  Reward: 0.842 ± 0.235, Acc: 100.0%, Format: 100.0%

--- Iteration 3/15 ---
  Reward: 0.823 ± 0.247, Acc: 100.0%, Format: 100.0%

--- Iteration 4/15 ---
  Reward: 0.901 ± 0.158, Acc: 100.0%, Format: 100.0%

--- Iteration 5/15 ---
  Reward: 0.852 ± 0.214, Acc: 100.0%, Format: 100.0%

--- Iteration 6/15 ---
  Reward: 0.843 ± 0.251, Acc: 99.6%, Format: 99.6%

--- Iteration 7/15 ---
  Reward: 0.859 ± 0.214, Acc: 100.0%, Format: 100.0%

--- Iteration 8/15 ---
  Reward: 0.899 ± 0.159, Acc: 100.0%, Format: 100.0%

--- Iteration 9/15 ---
  Reward: 0.870 ± 0.175, Acc: 100.0%, Format: 100.0%

--- Iteration 10/15 ---
  Reward: 0.866 ± 0.234, Acc: 99.6%, Format: 99.6%

--- Iteration 11/15 ---
  Reward: 0.845 ± 0.238, Acc: 100.0%, Format: 100.0%

--- Iteration 12/15 ---
  Reward: 0.908 ± 0.148, Acc: 100.0%, Format: 100.0%

--- Iteration 13/15 ---
  Reward: 0.838 ± 0.234, Acc: 100.0%, Format: 100.0%

--- Iteration 14/15 ---
  Reward: 0.899 ± 0.143, Acc: 100.0%, Format: 100.0%

--- Iteration 15/15 ---
  Reward: 0.895 ± 0.147, Acc: 100.0%, Format: 100.0%

Saving final RL checkpoint...
RL checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/sampler_weights/rl_final

--- Evaluating: RL Model ---
  Evaluated 50/200
  Evaluated 100/200
  Evaluated 150/200
  Evaluated 200/200
  Any Match: 90.0%
  Exact Match: 42.5%
  F1: 72.3%
  Mean Reward: 0.792

======================================================================
TRAINING COMPLETE
======================================================================
Results saved to: training/experiments/memory_routing_v1_20251124_165000/results.json

Final Model: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/sampler_weights/rl_final

Comparison:
  SFT  - F1: 69.2%, Any Match: 87.0%
  RL   - F1: 72.3%, Any Match: 90.0%