File size: 7,375 Bytes
685d968
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
Retrying due to status code 502. text=
======================================================================
MEMORY ROUTING AGENT - FULL TRAINING PIPELINE
======================================================================
Experiment: memory_routing_v1
Output: training/experiments/memory_routing_v1_20251124_165000
Base model: meta-llama/Llama-3.1-8B
LoRA rank: 32

======================================================================
PHASE 1: SUPERVISED FINE-TUNING
======================================================================
Train: 800, Test: 200
Learning rate: 2.86e-04
Step   0: train_loss=3.4228, test_loss=2.6279, time=3.3s
Step   1: train_loss=2.5284, time=34.7s
Step   2: train_loss=2.0672, time=4.1s
Step   3: train_loss=1.7094, time=4.3s
Step   4: train_loss=1.5843, time=2.5s
Step   5: train_loss=1.4973, time=3.0s
Step   6: train_loss=1.3900, time=4.6s
Step   7: train_loss=1.4226, time=24.7s
Step   8: train_loss=1.3094, time=2.6s
Step   9: train_loss=1.3240, time=3.4s
Step  10: train_loss=1.1783, test_loss=1.1197, time=2.9s
Step  11: train_loss=1.1683, time=3.0s
Step  12: train_loss=1.2817, time=3.1s
Step  13: train_loss=0.9658, time=2.4s
Step  14: train_loss=0.8791, time=34.4s
Step  15: train_loss=0.7782, time=33.0s
Step  16: train_loss=0.7206, time=3.1s
Step  17: train_loss=0.6524, time=2.4s
Step  18: train_loss=0.5603, time=2.9s
Step  19: train_loss=0.5045, time=4.4s
Step  20: train_loss=0.4175, test_loss=0.3288, time=2.7s
Step  21: train_loss=0.3219, time=2.2s
Step  22: train_loss=0.3643, time=2.4s
Step  23: train_loss=0.3799, time=2.1s
Step  24: train_loss=0.3603, time=2.4s
Step  25: train_loss=0.5269, time=1.9s
Step  26: train_loss=0.3044, time=29.7s
Step  27: train_loss=0.2869, time=3.5s
Step  28: train_loss=0.2994, time=4.4s
Step  29: train_loss=0.3266, time=2.2s
Step  30: train_loss=0.3303, test_loss=0.2598, time=2.3s
Step  31: train_loss=0.2958, time=1.8s
Step  32: train_loss=0.3050, time=2.0s
Step  33: train_loss=0.3092, time=33.7s
Step  34: train_loss=0.2802, time=2.1s
Step  35: train_loss=0.3087, time=2.0s
Step  36: train_loss=0.3042, time=2.0s
Step  37: train_loss=0.4495, time=3.2s
Step  38: train_loss=0.2939, time=2.0s
Step  39: train_loss=0.2473, time=2.0s
Step  40: train_loss=0.2092, test_loss=0.2544, time=2.8s
Step  41: train_loss=0.2836, time=2.9s
Step  42: train_loss=0.2363, time=2.0s
Step  43: train_loss=0.2641, time=2.1s
Step  44: train_loss=0.2647, time=2.2s
Step  45: train_loss=0.2634, time=3.5s
Step  46: train_loss=0.2576, time=2.7s
Step  47: train_loss=0.2471, time=2.5s
Step  48: train_loss=0.2778, time=2.7s
Step  49: train_loss=0.2875, time=7.9s
Step  50: train_loss=0.4188, test_loss=0.2334, time=2.2s
Step  51: train_loss=0.2511, time=2.7s
Step  52: train_loss=0.1968, time=28.9s
Step  53: train_loss=0.2182, time=2.8s
Step  54: train_loss=0.2473, time=34.8s
Step  55: train_loss=0.2404, time=2.6s
Step  56: train_loss=0.2247, time=2.5s
Step  57: train_loss=0.2161, time=2.2s
Step  58: train_loss=0.2167, time=1.9s
Step  59: train_loss=0.2116, time=2.1s
Step  60: train_loss=0.2304, test_loss=0.2018, time=3.1s
Step  61: train_loss=0.2512, time=2.8s
Step  62: train_loss=0.2886, time=2.0s
Step  63: train_loss=0.2893, time=1.9s
Step  64: train_loss=0.2319, time=2.0s
Step  65: train_loss=0.1766, time=1.9s
Step  66: train_loss=0.2583, time=2.3s
Step  67: train_loss=0.2068, time=3.1s
Step  68: train_loss=0.2338, time=2.5s
Step  69: train_loss=0.2009, time=2.0s
Step  70: train_loss=0.1942, test_loss=0.1832, time=2.6s
Step  71: train_loss=0.2030, time=2.2s
Step  72: train_loss=0.1983, time=24.0s
Step  73: train_loss=0.2216, time=2.8s
Step  74: train_loss=0.2449, time=2.7s
Step  75: train_loss=0.3014, time=2.8s
Step  76: train_loss=0.2157, time=2.8s
Step  77: train_loss=0.2117, time=16.5s
Step  78: train_loss=0.2102, time=32.4s
Step  79: train_loss=0.2355, time=2.1s
Step  80: train_loss=0.2199, test_loss=0.1973, time=2.3s
Step  81: train_loss=0.2125, time=3.6s
Step  82: train_loss=0.2148, time=2.2s
Step  83: train_loss=0.1887, time=2.5s
Step  84: train_loss=0.1713, time=31.9s
Step  85: train_loss=0.2361, time=2.3s
Step  86: train_loss=0.1958, time=35.1s
Step  87: train_loss=0.2396, time=2.3s
Step  88: train_loss=0.2032, time=32.1s
Step  89: train_loss=0.1682, time=82.7s
Step  90: train_loss=0.1952, test_loss=0.1960, time=2.6s
Step  91: train_loss=0.2146, time=2.3s
Step  92: train_loss=0.1845, time=28.6s
Step  93: train_loss=0.2103, time=3.3s
Step  94: train_loss=0.1943, time=3.3s
Step  95: train_loss=0.1729, time=3.1s
Step  96: train_loss=0.1698, time=2.8s
Step  97: train_loss=0.2020, time=3.2s
Step  98: train_loss=0.1963, time=3.6s
Step  99: train_loss=0.2097, test_loss=0.1150, time=3.1s

Saving final SFT checkpoint...
SFT State checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/weights/sft_final
SFT Sampler checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/sampler_weights/sft_final_sampler

--- Evaluating: SFT Model ---
  Evaluated 50/200
  Evaluated 100/200
  Evaluated 150/200
  Evaluated 200/200
  Any Match: 87.0%
  Exact Match: 39.0%
  F1: 69.2%
  Mean Reward: 0.772

======================================================================
PHASE 2: REINFORCEMENT LEARNING
======================================================================
Training examples: 800
RL iterations: 15
Batch size: 32, Group size: 8

Loading SFT checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/weights/sft_final

--- Iteration 1/15 ---
  Reward: 0.872 ± 0.192, Acc: 100.0%, Format: 100.0%

--- Iteration 2/15 ---
  Reward: 0.842 ± 0.235, Acc: 100.0%, Format: 100.0%

--- Iteration 3/15 ---
  Reward: 0.823 ± 0.247, Acc: 100.0%, Format: 100.0%

--- Iteration 4/15 ---
  Reward: 0.901 ± 0.158, Acc: 100.0%, Format: 100.0%

--- Iteration 5/15 ---
  Reward: 0.852 ± 0.214, Acc: 100.0%, Format: 100.0%

--- Iteration 6/15 ---
  Reward: 0.843 ± 0.251, Acc: 99.6%, Format: 99.6%

--- Iteration 7/15 ---
  Reward: 0.859 ± 0.214, Acc: 100.0%, Format: 100.0%

--- Iteration 8/15 ---
  Reward: 0.899 ± 0.159, Acc: 100.0%, Format: 100.0%

--- Iteration 9/15 ---
  Reward: 0.870 ± 0.175, Acc: 100.0%, Format: 100.0%

--- Iteration 10/15 ---
  Reward: 0.866 ± 0.234, Acc: 99.6%, Format: 99.6%

--- Iteration 11/15 ---
  Reward: 0.845 ± 0.238, Acc: 100.0%, Format: 100.0%

--- Iteration 12/15 ---
  Reward: 0.908 ± 0.148, Acc: 100.0%, Format: 100.0%

--- Iteration 13/15 ---
  Reward: 0.838 ± 0.234, Acc: 100.0%, Format: 100.0%

--- Iteration 14/15 ---
  Reward: 0.899 ± 0.143, Acc: 100.0%, Format: 100.0%

--- Iteration 15/15 ---
  Reward: 0.895 ± 0.147, Acc: 100.0%, Format: 100.0%

Saving final RL checkpoint...
RL checkpoint: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/sampler_weights/rl_final

--- Evaluating: RL Model ---
  Evaluated 50/200
  Evaluated 100/200
  Evaluated 150/200
  Evaluated 200/200
  Any Match: 90.0%
  Exact Match: 42.5%
  F1: 72.3%
  Mean Reward: 0.792

======================================================================
TRAINING COMPLETE
======================================================================
Results saved to: training/experiments/memory_routing_v1_20251124_165000/results.json

Final Model: tinker://b6c9686e-b64d-5cd9-b9e5-a882b0f69d6a:train:0/sampler_weights/rl_final

Comparison:
  SFT  - F1: 69.2%, Any Match: 87.0%
  RL   - F1: 72.3%, Any Match: 90.0%