Retrying due to status code 502. text= ====================================================================== MEMORY ROUTING AGENT - FULL TRAINING PIPELINE ====================================================================== Experiment: memory_routing_v1 Output: training/experiments/memory_routing_v1_20251124_165000 Base model: meta-llama/Llama-3.1-8B LoRA rank: 32 ====================================================================== PHASE 1: SUPERVISED FINE-TUNING ====================================================================== Train: 800, Test: 200 Learning rate: 2.86e-04 Step 0: train_loss=3.4228, test_loss=2.6279, time=3.3s Step 1: train_loss=2.5284, time=34.7s Step 2: train_loss=2.0672, time=4.1s Step 3: train_loss=1.7094, time=4.3s Step 4: train_loss=1.5843, time=2.5s Step 5: train_loss=1.4973, time=3.0s Step 6: train_loss=1.3900, time=4.6s Step 7: train_loss=1.4226, time=24.7s Step 8: train_loss=1.3094, time=2.6s Step 9: train_loss=1.3240, time=3.4s Step 10: train_loss=1.1783, test_loss=1.1197, time=2.9s Step 11: train_loss=1.1683, time=3.0s Step 12: train_loss=1.2817, time=3.1s Step 13: train_loss=0.9658, time=2.4s Step 14: train_loss=0.8791, time=34.4s Step 15: train_loss=0.7782, time=33.0s Step 16: train_loss=0.7206, time=3.1s Step 17: train_loss=0.6524, time=2.4s Step 18: train_loss=0.5603, time=2.9s Step 19: train_loss=0.5045, time=4.4s Step 20: train_loss=0.4175, test_loss=0.3288, time=2.7s Step 21: train_loss=0.3219, time=2.2s Step 22: train_loss=0.3643, time=2.4s Step 23: train_loss=0.3799, time=2.1s Step 24: train_loss=0.3603, time=2.4s Step 25: train_loss=0.5269, time=1.9s Step 26: train_loss=0.3044, time=29.7s Step 27: train_loss=0.2869, time=3.5s Step 28: train_loss=0.2994, time=4.4s Step 29: train_loss=0.3266, time=2.2s Step 30: train_loss=0.3303, test_loss=0.2598, time=2.3s Step 31: train_loss=0.2958, time=1.8s Step 32: train_loss=0.3050, time=2.0s Step 33: train_loss=0.3092, time=33.7s Step 34: train_loss=0.2802, time=2.1s Step 35: train_loss=0.3087, time=2.0s Step 36: train_loss=0.3042, time=2.0s Step 37: train_loss=0.4495, time=3.2s Step 38: train_loss=0.2939, time=2.0s Step 39: train_loss=0.2473, time=2.0s Step 40: train_loss=0.2092, test_loss=0.2544, time=2.8s Step 41: train_loss=0.2836, time=2.9s Step 42: train_loss=0.2363, time=2.0s Step 43: train_loss=0.2641, time=2.1s Step 44: train_loss=0.2647, time=2.2s Step 45: train_loss=0.2634, time=3.5s Step 46: train_loss=0.2576, time=2.7s Step 47: train_loss=0.2471, time=2.5s Step 48: train_loss=0.2778, time=2.7s Step 49: train_loss=0.2875, time=7.9s Step 50: train_loss=0.4188, test_loss=0.2334, time=2.2s Step 51: train_loss=0.2511, time=2.7s Step 52: train_loss=0.1968, time=28.9s Step 53: train_loss=0.2182, time=2.8s Step 54: train_loss=0.2473, time=34.8s Step 55: train_loss=0.2404, time=2.6s Step 56: train_loss=0.2247, time=2.5s Step 57: train_loss=0.2161, time=2.2s Step 58: train_loss=0.2167, time=1.9s Step 59: train_loss=0.2116, time=2.1s Step 60: train_loss=0.2304, test_loss=0.2018, time=3.1s Step 61: train_loss=0.2512, time=2.8s Step 62: train_loss=0.2886, time=2.0s Step 63: train_loss=0.2893, time=1.9s Step 64: train_loss=0.2319, time=2.0s Step 65: train_loss=0.1766, time=1.9s Step 66: train_loss=0.2583, time=2.3s Step 67: train_loss=0.2068, time=3.1s Step 68: train_loss=0.2338, time=2.5s Step 69: train_loss=0.2009, time=2.0s Step 70: train_loss=0.1942, test_loss=0.1832, time=2.6s Step 71: train_loss=0.2030, time=2.2s Step 72: train_loss=0.1983, time=24.0s Step 73: train_loss=0.2216, time=2.8s Step 74: train_loss=0.2449, time=2.7s Step 75: train_loss=0.3014, time=2.8s Step 76: train_loss=0.2157, time=2.8s Step 77: train_loss=0.2117, time=16.5s Step 78: train_loss=0.2102, time=32.4s Step 79: train_loss=0.2355, time=2.1s Step 80: train_loss=0.2199, test_loss=0.1973, time=2.3s Step 81: train_loss=0.2125, time=3.6s Step 82: train_loss=0.2148, time=2.2s Step 83: train_loss=0.1887, time=2.5s Step 84: train_loss=0.1713, time=31.9s Step 85: train_loss=0.2361, time=2.3s Step 86: train_loss=0.1958, time=35.1s Step 87: train_loss=0.2396, time=2.3s Step 88: train_loss=0.2032, time=32.1s Step 89: train_loss=0.1682, time=82.7s Step 90: train_loss=0.1952, test_loss=0.1960, time=2.6s Step 91: train_loss=0.2146, time=2.3s Step 92: train_loss=0.1845, time=28.6s Step 93: train_loss=0.2103, time=3.3s Step 94: train_loss=0.1943, time=3.3s Step 95: train_loss=0.1729, time=3.1s Step 96: train_loss=0.1698, time=2.8s Step 97: train_loss=0.2020, time=3.2s Step 98: train_loss=0.1963, time=3.6s Step 99: train_loss=0.2097, test_loss=0.1150, time=3.1s Saving final SFT checkpoint... SFT State checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/weights/sft_final SFT Sampler checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/sampler_weights/sft_final_sampler --- Evaluating: SFT Model --- Evaluated 50/200 Evaluated 100/200 Evaluated 150/200 Evaluated 200/200 Any Match: 87.0% Exact Match: 39.0% F1: 69.2% Mean Reward: 0.772 ====================================================================== PHASE 2: REINFORCEMENT LEARNING ====================================================================== Training examples: 800 RL iterations: 15 Batch size: 32, Group size: 8 Loading SFT checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/weights/sft_final --- Iteration 1/15 --- Reward: 0.872 ± 0.192, Acc: 100.0%, Format: 100.0% --- Iteration 2/15 --- Reward: 0.842 ± 0.235, Acc: 100.0%, Format: 100.0% --- Iteration 3/15 --- Reward: 0.823 ± 0.247, Acc: 100.0%, Format: 100.0% --- Iteration 4/15 --- Reward: 0.901 ± 0.158, Acc: 100.0%, Format: 100.0% --- Iteration 5/15 --- Reward: 0.852 ± 0.214, Acc: 100.0%, Format: 100.0% --- Iteration 6/15 --- Reward: 0.843 ± 0.251, Acc: 99.6%, Format: 99.6% --- Iteration 7/15 --- Reward: 0.859 ± 0.214, Acc: 100.0%, Format: 100.0% --- Iteration 8/15 --- Reward: 0.899 ± 0.159, Acc: 100.0%, Format: 100.0% --- Iteration 9/15 --- Reward: 0.870 ± 0.175, Acc: 100.0%, Format: 100.0% --- Iteration 10/15 --- Reward: 0.866 ± 0.234, Acc: 99.6%, Format: 99.6% --- Iteration 11/15 --- Reward: 0.845 ± 0.238, Acc: 100.0%, Format: 100.0% --- Iteration 12/15 --- Reward: 0.908 ± 0.148, Acc: 100.0%, Format: 100.0% --- Iteration 13/15 --- Reward: 0.838 ± 0.234, Acc: 100.0%, Format: 100.0% --- Iteration 14/15 --- Reward: 0.899 ± 0.143, Acc: 100.0%, Format: 100.0% --- Iteration 15/15 --- Reward: 0.895 ± 0.147, Acc: 100.0%, Format: 100.0% Saving final RL checkpoint... RL checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/sampler_weights/rl_final --- Evaluating: RL Model --- Evaluated 50/200 Evaluated 100/200 Evaluated 150/200 Evaluated 200/200 Any Match: 90.0% Exact Match: 42.5% F1: 72.3% Mean Reward: 0.792 ====================================================================== TRAINING COMPLETE ====================================================================== Results saved to: training/experiments/memory_routing_v1_20251124_165000/results.json Final Model: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/sampler_weights/rl_final Comparison: SFT - F1: 69.2%, Any Match: 87.0% RL - F1: 72.3%, Any Match: 90.0%