NorthernTribe-Research commited on
Commit
425a76b
·
verified ·
1 Parent(s): 97a59ed

Autonomous Space trainer update

Browse files
README.md CHANGED
@@ -2,182 +2,23 @@
2
  language:
3
  - en
4
  library_name: transformers
5
- pipeline_tag: text-generation
6
  datasets:
7
  - NorthernTribe-Research/UMSR-v1
8
  tags:
9
  - reasoning
10
- - instruction-following
11
- - structured-output
12
- - math
13
- - science
14
- - logic
15
- - strategy
16
  ---
17
 
18
- # UMSR-Reasoner-7B
19
 
20
- ## Overview
21
 
22
- UMSR-Reasoner-7B is a standalone 7B reasoning model for structured multi-step problem solving.
23
-
24
- It is optimized for tasks that require:
25
-
26
- - explicit reasoning traces
27
- - deterministic final-answer formatting
28
- - consistent performance across math, science, logic, and strategy domains
29
-
30
- ## Dataset
31
-
32
- - Primary dataset: https://huggingface.co/datasets/NorthernTribe-Research/UMSR-v1
33
-
34
- ## Training Strategy
35
-
36
- - student architecture: `NorthernTribe-Research/UMSR-Reasoner-7B`
37
- - teacher architecture: `NorthernTribe-Research/UMSR-Reasoner-7B` self-distillation (default)
38
- - objective: blended CE + KL distillation with temperature and weight scheduling
39
- - continuity: checkpointed autonomous training cycles with resume support
40
 
41
  ## Output Contract
42
 
43
- For best reliability, prompt the model to end with:
44
-
45
- ```text
46
- <final_answer>...</final_answer>
47
- ```
48
-
49
- Optional reasoning can be requested in:
50
-
51
- ```text
52
- <reasoning>...</reasoning>
53
- ```
54
-
55
- ## Model Tree
56
-
57
- | Variant | Repository | Purpose |
58
- |---|---|---|
59
- | Base FP model | `NorthernTribe-Research/UMSR-Reasoner-7B` | Primary inference and fine-tuning target |
60
- | INT8 runtime profile | `NorthernTribe-Research/UMSR-Reasoner-7B-INT8` | Lower-memory deployment |
61
- | NF4 runtime profile | `NorthernTribe-Research/UMSR-Reasoner-7B-NF4` | Max compression for constrained GPUs |
62
- | Smoke INT8 profile | `NorthernTribe-Research/UMSR-Reasoner-7B-Smoke-INT8` | Fast CI/smoke validation profile linked to the main model tree |
63
-
64
- ## Quickstart
65
-
66
- ```python
67
- import torch
68
- from transformers import AutoModelForCausalLM, AutoTokenizer
69
-
70
- model_id = "NorthernTribe-Research/UMSR-Reasoner-7B"
71
-
72
- tokenizer = AutoTokenizer.from_pretrained(model_id)
73
- model = AutoModelForCausalLM.from_pretrained(
74
- model_id,
75
- torch_dtype=torch.bfloat16,
76
- device_map="auto",
77
- )
78
-
79
- messages = [
80
- {"role": "system", "content": "Solve step by step and finish with <final_answer>...</final_answer>."},
81
- {"role": "user", "content": "If 3x + 5 = 20, what is x?"},
82
- ]
83
-
84
- inputs = tokenizer.apply_chat_template(
85
- messages,
86
- add_generation_prompt=True,
87
- return_tensors="pt",
88
- ).to(model.device)
89
-
90
- outputs = model.generate(
91
- inputs,
92
- max_new_tokens=256,
93
- temperature=0.2,
94
- top_p=0.9,
95
- )
96
-
97
- print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
98
- ```
99
-
100
- ## Quantized Runtime
101
-
102
- ### INT8
103
-
104
- ```python
105
- from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
106
-
107
- model_id = "NorthernTribe-Research/UMSR-Reasoner-7B"
108
- bnb_config = BitsAndBytesConfig(load_in_8bit=True)
109
-
110
- tokenizer = AutoTokenizer.from_pretrained(model_id)
111
- model = AutoModelForCausalLM.from_pretrained(
112
- model_id,
113
- device_map="auto",
114
- quantization_config=bnb_config,
115
- )
116
- ```
117
-
118
- ### NF4
119
-
120
- ```python
121
- import torch
122
- from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
123
-
124
- model_id = "NorthernTribe-Research/UMSR-Reasoner-7B"
125
- bnb_config = BitsAndBytesConfig(
126
- load_in_4bit=True,
127
- bnb_4bit_quant_type="nf4",
128
- bnb_4bit_use_double_quant=True,
129
- bnb_4bit_compute_dtype=torch.bfloat16,
130
- )
131
-
132
- tokenizer = AutoTokenizer.from_pretrained(model_id)
133
- model = AutoModelForCausalLM.from_pretrained(
134
- model_id,
135
- device_map="auto",
136
- quantization_config=bnb_config,
137
- )
138
- ```
139
-
140
- ## Llamafile Packaging
141
-
142
- For single-binary deployment, use:
143
-
144
- ```bash
145
- python scripts/create_llamafile.py \
146
- --gguf /path/to/UMSR-Reasoner-7B.Q4_K_M.gguf \
147
- --runtime-bin tools/llamafile \
148
- --output dist/UMSR-Reasoner-7B.Q4_K_M.llamafile \
149
- --force
150
- ```
151
-
152
- ## Code-Aware Robust Evaluation
153
-
154
- `scripts/eval_reasoner.py` supports code-focused robustness checks:
155
-
156
- - code-task detection
157
- - Python code-block syntax validation
158
- - optional row-level unit-test execution
159
- - optional TensorFlow-backed multi-candidate scorer
160
-
161
- ## Trainer Integration
162
-
163
- An autonomous trainer Space is available for continuous training cycles against UMSR-v1. It supports:
164
-
165
- - teacher-student distillation mode with configurable in-house teacher model
166
- - live run telemetry (`live_progress.json`, `live_events.jsonl`) for real-time monitoring
167
- - scheduled or continuous operation
168
- - checkpoint auto-resume (`UMSR_RESUME_FROM_CHECKPOINT=auto`)
169
- - warmup-step and warmup-ratio control
170
- - push-to-hub automation
171
- - run monitoring through live dashboard and logs
172
-
173
- ## Best Practices
174
-
175
- - keep prompts explicit about output tags
176
- - validate final answers for high-stakes workflows
177
- - prefer domain-specific evaluation before production deployment
178
-
179
- ## Limitations
180
 
181
- - reasoning text may contain errors even when final format is correct
182
- - quality depends on prompt clarity and task scope
183
- - not suitable as a sole decision-maker for legal, medical, or safety-critical use
 
2
  language:
3
  - en
4
  library_name: transformers
 
5
  datasets:
6
  - NorthernTribe-Research/UMSR-v1
7
  tags:
8
  - reasoning
9
+ - autonomous-training
 
 
 
 
 
10
  ---
11
 
12
+ # UMSR Reasoner 7B
13
 
14
+ Standalone reasoning model trained from UMSR-v1 using the autonomous trainer Space.
15
 
16
+ - Dataset: https://huggingface.co/datasets/NorthernTribe-Research/UMSR-v1
17
+ - Base model: `sshleifer/tiny-gpt2`
18
+ - Model repo: `https://huggingface.co/NorthernTribe-Research/UMSR-Reasoner-7B`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## Output Contract
21
 
22
+ Use:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
+ `<final_answer>...</final_answer>`
 
 
config.json CHANGED
@@ -18,7 +18,7 @@
18
  "n_inner": null,
19
  "n_layer": 2,
20
  "n_positions": 1024,
21
- "pad_token_id": null,
22
  "reorder_and_upcast_attn": false,
23
  "resid_pdrop": 0.1,
24
  "scale_attn_by_inverse_layer_idx": false,
@@ -34,7 +34,7 @@
34
  "max_length": 50
35
  }
36
  },
37
- "tie_word_embeddings": true,
38
  "transformers_version": "5.2.0",
39
  "use_cache": false,
40
  "vocab_size": 50257
 
18
  "n_inner": null,
19
  "n_layer": 2,
20
  "n_positions": 1024,
21
+ "pad_token_id": 50256,
22
  "reorder_and_upcast_attn": false,
23
  "resid_pdrop": 0.1,
24
  "scale_attn_by_inverse_layer_idx": false,
 
34
  "max_length": 50
35
  }
36
  },
37
+ "tie_word_embeddings": false,
38
  "transformers_version": "5.2.0",
39
  "use_cache": false,
40
  "vocab_size": 50257
metrics/eval_metrics.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
  "epoch": 1.0,
3
- "eval_loss": 5.414119720458984,
4
- "eval_runtime": 0.7383,
5
- "eval_samples": 8,
6
- "eval_samples_per_second": 10.836,
7
- "eval_steps_per_second": 10.836
8
- }
 
1
  {
2
  "epoch": 1.0,
3
+ "eval_loss": 10.716222763061523,
4
+ "eval_runtime": 2.0679,
5
+ "eval_samples": 64,
6
+ "eval_samples_per_second": 30.95,
7
+ "eval_steps_per_second": 30.95
8
+ }
metrics/train_metrics.json CHANGED
@@ -1,16 +1,9 @@
1
  {
2
- "ce_weight_end": 0.5,
3
- "ce_weight_start": 0.5,
4
  "epoch": 1.0,
5
- "kd_weight_end": 0.5,
6
- "kd_weight_start": 0.5,
7
- "teacher_count": 1,
8
- "temperature_end": 1.5,
9
- "temperature_start": 2.0,
10
- "total_flos": 3359425752.0,
11
- "train_loss": 5.411877933301423,
12
- "train_runtime": 7.7818,
13
- "train_samples": 19,
14
- "train_samples_per_second": 2.442,
15
- "train_steps_per_second": 2.442
16
- }
 
1
  {
 
 
2
  "epoch": 1.0,
3
+ "total_flos": 39501942396.0,
4
+ "train_loss": 10.738998085260391,
5
+ "train_runtime": 49.7234,
6
+ "train_samples": 256,
7
+ "train_samples_per_second": 5.148,
8
+ "train_steps_per_second": 5.148
9
+ }
 
 
 
 
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b75f6f94d4b470fd42d5cb3c7135ca7c1d669d0514eddce08abe5825cf9d5c48
3
  size 413296
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8af636bf1ea0ba12e8e0ed9858fe7a4fb9ce267806245af516a5f024c1c370c1
3
  size 413296
run_summary.json CHANGED
@@ -1,15 +1,17 @@
1
  {
2
- "ce_weight_end": 0.5,
3
- "ce_weight_start": 0.5,
 
4
  "dataset_id": "NorthernTribe-Research/UMSR-v1",
5
- "eval_rows": 8,
6
- "kd_weight_end": 0.5,
7
- "kd_weight_start": 0.5,
8
- "output_dir": "outputs/umsr_reasoner_7b_standalone",
9
- "student_model": "NorthernTribe-Research/UMSR-Reasoner-7B",
10
- "teacher_count": 1,
11
- "teacher_model": "NorthernTribe-Research/UMSR-Reasoner-7B",
12
- "temperature_end": 1.5,
13
- "temperature_start": 2.0,
14
- "train_rows": 19
 
15
  }
 
1
  {
2
+ "base_model": "sshleifer/tiny-gpt2",
3
+ "bf16": false,
4
+ "cuda_available": false,
5
  "dataset_id": "NorthernTribe-Research/UMSR-v1",
6
+ "device": "cpu",
7
+ "eval_rows": 64,
8
+ "finished_at": "2026-02-23T22:10:16.635458+00:00",
9
+ "fp16": false,
10
+ "mps_available": false,
11
+ "output_dir": "/app/runs/20260223_220918",
12
+ "target_repo_id": "NorthernTribe-Research/UMSR-Reasoner-7B",
13
+ "tie_word_embeddings": false,
14
+ "total_train_steps_estimate": 256,
15
+ "train_rows": 256,
16
+ "warmup_steps": 0
17
  }
tokenizer.json CHANGED
@@ -1,6 +1,11 @@
1
  {
2
  "version": "1.0",
3
- "truncation": null,
 
 
 
 
 
4
  "padding": null,
5
  "added_tokens": [
6
  {
 
1
  {
2
  "version": "1.0",
3
+ "truncation": {
4
+ "direction": "Right",
5
+ "max_length": 512,
6
+ "strategy": "LongestFirst",
7
+ "stride": 0
8
+ },
9
  "padding": null,
10
  "added_tokens": [
11
  {
trainer_state.json CHANGED
@@ -3,192 +3,282 @@
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
  "epoch": 1.0,
6
- "eval_steps": 4,
7
- "global_step": 19,
8
  "is_hyper_param_search": false,
9
  "is_local_process_zero": true,
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
- "epoch": 0.05263157894736842,
14
- "grad_norm": 0.374282568693161,
15
- "learning_rate": 0.0001,
16
- "loss": 5.411169052124023,
17
- "step": 1
18
  },
19
  {
20
- "epoch": 0.10526315789473684,
21
- "grad_norm": 0.4834381341934204,
22
- "learning_rate": 9.931806517013612e-05,
23
- "loss": 5.413699626922607,
24
- "step": 2
25
  },
26
  {
27
- "epoch": 0.15789473684210525,
28
- "grad_norm": 1.9866410493850708,
29
- "learning_rate": 9.729086208503174e-05,
30
- "loss": 5.407270431518555,
31
- "step": 3
 
32
  },
33
  {
34
- "epoch": 0.21052631578947367,
35
- "grad_norm": 0.13184547424316406,
36
- "learning_rate": 9.397368756032445e-05,
37
- "loss": 5.411037921905518,
38
- "step": 4
39
  },
40
  {
41
- "epoch": 0.21052631578947367,
42
- "eval_loss": 5.415262222290039,
43
- "eval_runtime": 0.6782,
44
- "eval_samples_per_second": 11.796,
45
- "eval_steps_per_second": 11.796,
46
- "step": 4
47
  },
48
  {
49
- "epoch": 0.2631578947368421,
50
- "grad_norm": 0.08807552605867386,
51
- "learning_rate": 8.945702546981969e-05,
52
- "loss": 5.413403034210205,
53
- "step": 5
54
  },
55
  {
56
- "epoch": 0.3157894736842105,
57
- "grad_norm": 0.3944004476070404,
58
- "learning_rate": 8.386407858128706e-05,
59
- "loss": 5.411047458648682,
60
- "step": 6
 
61
  },
62
  {
63
- "epoch": 0.3684210526315789,
64
- "grad_norm": 0.19243471324443817,
65
- "learning_rate": 7.734740790612136e-05,
66
- "loss": 5.410379886627197,
67
- "step": 7
68
  },
69
  {
70
- "epoch": 0.42105263157894735,
71
- "grad_norm": 1.1386080980300903,
72
- "learning_rate": 7.008477123264848e-05,
73
- "loss": 5.4107584953308105,
74
- "step": 8
75
  },
76
  {
77
- "epoch": 0.42105263157894735,
78
- "eval_loss": 5.414775848388672,
79
- "eval_runtime": 0.6096,
80
- "eval_samples_per_second": 13.123,
81
- "eval_steps_per_second": 13.123,
82
- "step": 8
83
  },
84
  {
85
- "epoch": 0.47368421052631576,
86
- "grad_norm": 1.593002200126648,
87
- "learning_rate": 6.227427435703997e-05,
88
- "loss": 5.406350135803223,
89
- "step": 9
90
  },
91
  {
92
- "epoch": 0.5263157894736842,
93
- "grad_norm": 0.2701893746852875,
94
- "learning_rate": 5.4128967273616625e-05,
95
- "loss": 5.415624141693115,
96
- "step": 10
97
  },
98
  {
99
- "epoch": 0.5789473684210527,
100
- "grad_norm": 2.5317883491516113,
101
- "learning_rate": 4.5871032726383386e-05,
102
- "loss": 5.4125213623046875,
103
- "step": 11
104
  },
105
  {
106
- "epoch": 0.631578947368421,
107
- "grad_norm": 0.18132847547531128,
108
- "learning_rate": 3.772572564296005e-05,
109
- "loss": 5.411061763763428,
110
- "step": 12
 
111
  },
112
  {
113
- "epoch": 0.631578947368421,
114
- "eval_loss": 5.414316177368164,
115
- "eval_runtime": 0.6242,
116
- "eval_samples_per_second": 12.817,
117
- "eval_steps_per_second": 12.817,
118
- "step": 12
119
  },
120
  {
121
- "epoch": 0.6842105263157895,
122
- "grad_norm": 0.25826773047447205,
123
- "learning_rate": 2.991522876735154e-05,
124
- "loss": 5.412132740020752,
125
- "step": 13
126
  },
127
  {
128
- "epoch": 0.7368421052631579,
129
- "grad_norm": 0.32883742451667786,
130
- "learning_rate": 2.2652592093878666e-05,
131
- "loss": 5.409476280212402,
132
- "step": 14
 
133
  },
134
  {
135
- "epoch": 0.7894736842105263,
136
- "grad_norm": 0.15471753478050232,
137
- "learning_rate": 1.6135921418712956e-05,
138
- "loss": 5.413200855255127,
139
- "step": 15
140
  },
141
  {
142
- "epoch": 0.8421052631578947,
143
- "grad_norm": 0.2990401089191437,
144
- "learning_rate": 1.0542974530180327e-05,
145
- "loss": 5.4155683517456055,
146
- "step": 16
147
  },
148
  {
149
- "epoch": 0.8421052631578947,
150
- "eval_loss": 5.414139747619629,
151
- "eval_runtime": 0.7326,
152
- "eval_samples_per_second": 10.92,
153
- "eval_steps_per_second": 10.92,
154
- "step": 16
155
  },
156
  {
157
- "epoch": 0.8947368421052632,
158
- "grad_norm": 2.2788937091827393,
159
- "learning_rate": 6.026312439675552e-06,
160
- "loss": 5.411947727203369,
161
- "step": 17
 
162
  },
163
  {
164
- "epoch": 0.9473684210526315,
165
- "grad_norm": 0.20979730784893036,
166
- "learning_rate": 2.7091379149682685e-06,
167
- "loss": 5.413649082183838,
168
- "step": 18
169
  },
170
  {
171
- "epoch": 1.0,
172
- "grad_norm": 0.20296302437782288,
173
- "learning_rate": 6.819348298638839e-07,
174
- "loss": 5.415382385253906,
175
- "step": 19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176
  },
177
  {
178
  "epoch": 1.0,
179
- "step": 19,
180
- "total_flos": 3359425752.0,
181
- "train_loss": 5.411877933301423,
182
- "train_runtime": 7.7818,
183
- "train_samples_per_second": 2.442,
184
- "train_steps_per_second": 2.442
185
  }
186
  ],
187
- "logging_steps": 1,
188
- "max_steps": 19,
189
  "num_input_tokens_seen": 0,
190
  "num_train_epochs": 1,
191
- "save_steps": 8,
192
  "stateful_callbacks": {
193
  "TrainerControl": {
194
  "args": {
@@ -201,7 +291,7 @@
201
  "attributes": {}
202
  }
203
  },
204
- "total_flos": 3359425752.0,
205
  "train_batch_size": 1,
206
  "trial_name": null,
207
  "trial_params": null
 
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
  "epoch": 1.0,
6
+ "eval_steps": 25,
7
+ "global_step": 256,
8
  "is_hyper_param_search": false,
9
  "is_local_process_zero": true,
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
+ "epoch": 0.0390625,
14
+ "grad_norm": 0.6599220633506775,
15
+ "learning_rate": 9.6484375e-05,
16
+ "loss": 10.769454956054688,
17
+ "step": 10
18
  },
19
  {
20
+ "epoch": 0.078125,
21
+ "grad_norm": 0.7802038192749023,
22
+ "learning_rate": 9.257812500000001e-05,
23
+ "loss": 10.763140106201172,
24
+ "step": 20
25
  },
26
  {
27
+ "epoch": 0.09765625,
28
+ "eval_loss": 10.76131820678711,
29
+ "eval_runtime": 2.0075,
30
+ "eval_samples_per_second": 31.881,
31
+ "eval_steps_per_second": 31.881,
32
+ "step": 25
33
  },
34
  {
35
+ "epoch": 0.1171875,
36
+ "grad_norm": 0.17565655708312988,
37
+ "learning_rate": 8.8671875e-05,
38
+ "loss": 10.768231201171876,
39
+ "step": 30
40
  },
41
  {
42
+ "epoch": 0.15625,
43
+ "grad_norm": 0.4683985710144043,
44
+ "learning_rate": 8.4765625e-05,
45
+ "loss": 10.772106170654297,
46
+ "step": 40
 
47
  },
48
  {
49
+ "epoch": 0.1953125,
50
+ "grad_norm": 0.3628169298171997,
51
+ "learning_rate": 8.0859375e-05,
52
+ "loss": 10.764241790771484,
53
+ "step": 50
54
  },
55
  {
56
+ "epoch": 0.1953125,
57
+ "eval_loss": 10.752297401428223,
58
+ "eval_runtime": 1.997,
59
+ "eval_samples_per_second": 32.049,
60
+ "eval_steps_per_second": 32.049,
61
+ "step": 50
62
  },
63
  {
64
+ "epoch": 0.234375,
65
+ "grad_norm": 0.6150558590888977,
66
+ "learning_rate": 7.695312500000001e-05,
67
+ "loss": 10.760368347167969,
68
+ "step": 60
69
  },
70
  {
71
+ "epoch": 0.2734375,
72
+ "grad_norm": 0.605256974697113,
73
+ "learning_rate": 7.3046875e-05,
74
+ "loss": 10.758135986328124,
75
+ "step": 70
76
  },
77
  {
78
+ "epoch": 0.29296875,
79
+ "eval_loss": 10.743587493896484,
80
+ "eval_runtime": 2.0728,
81
+ "eval_samples_per_second": 30.876,
82
+ "eval_steps_per_second": 30.876,
83
+ "step": 75
84
  },
85
  {
86
+ "epoch": 0.3125,
87
+ "grad_norm": 0.3574382960796356,
88
+ "learning_rate": 6.9140625e-05,
89
+ "loss": 10.750237274169923,
90
+ "step": 80
91
  },
92
  {
93
+ "epoch": 0.3515625,
94
+ "grad_norm": 0.6408470869064331,
95
+ "learning_rate": 6.5234375e-05,
96
+ "loss": 10.751436614990235,
97
+ "step": 90
98
  },
99
  {
100
+ "epoch": 0.390625,
101
+ "grad_norm": 0.2543087303638458,
102
+ "learning_rate": 6.132812500000001e-05,
103
+ "loss": 10.753249359130859,
104
+ "step": 100
105
  },
106
  {
107
+ "epoch": 0.390625,
108
+ "eval_loss": 10.736135482788086,
109
+ "eval_runtime": 1.9495,
110
+ "eval_samples_per_second": 32.828,
111
+ "eval_steps_per_second": 32.828,
112
+ "step": 100
113
  },
114
  {
115
+ "epoch": 0.4296875,
116
+ "grad_norm": 0.29619327187538147,
117
+ "learning_rate": 5.7421875000000005e-05,
118
+ "loss": 10.720957946777343,
119
+ "step": 110
 
120
  },
121
  {
122
+ "epoch": 0.46875,
123
+ "grad_norm": 0.19426824152469635,
124
+ "learning_rate": 5.3515625e-05,
125
+ "loss": 10.75610580444336,
126
+ "step": 120
127
  },
128
  {
129
+ "epoch": 0.48828125,
130
+ "eval_loss": 10.730175018310547,
131
+ "eval_runtime": 2.0434,
132
+ "eval_samples_per_second": 31.32,
133
+ "eval_steps_per_second": 31.32,
134
+ "step": 125
135
  },
136
  {
137
+ "epoch": 0.5078125,
138
+ "grad_norm": 0.400691419839859,
139
+ "learning_rate": 4.9609375000000005e-05,
140
+ "loss": 10.729013061523437,
141
+ "step": 130
142
  },
143
  {
144
+ "epoch": 0.546875,
145
+ "grad_norm": 0.2926430106163025,
146
+ "learning_rate": 4.5703125e-05,
147
+ "loss": 10.718487548828126,
148
+ "step": 140
149
  },
150
  {
151
+ "epoch": 0.5859375,
152
+ "grad_norm": 0.24082393944263458,
153
+ "learning_rate": 4.1796875000000005e-05,
154
+ "loss": 10.736891174316407,
155
+ "step": 150
 
156
  },
157
  {
158
+ "epoch": 0.5859375,
159
+ "eval_loss": 10.725324630737305,
160
+ "eval_runtime": 2.1134,
161
+ "eval_samples_per_second": 30.283,
162
+ "eval_steps_per_second": 30.283,
163
+ "step": 150
164
  },
165
  {
166
+ "epoch": 0.625,
167
+ "grad_norm": 0.371867835521698,
168
+ "learning_rate": 3.7890625e-05,
169
+ "loss": 10.728065490722656,
170
+ "step": 160
171
  },
172
  {
173
+ "epoch": 0.6640625,
174
+ "grad_norm": 0.3023074269294739,
175
+ "learning_rate": 3.3984375000000004e-05,
176
+ "loss": 10.73614501953125,
177
+ "step": 170
178
+ },
179
+ {
180
+ "epoch": 0.68359375,
181
+ "eval_loss": 10.721412658691406,
182
+ "eval_runtime": 2.1412,
183
+ "eval_samples_per_second": 29.89,
184
+ "eval_steps_per_second": 29.89,
185
+ "step": 175
186
+ },
187
+ {
188
+ "epoch": 0.703125,
189
+ "grad_norm": 0.38127991557121277,
190
+ "learning_rate": 3.0078125e-05,
191
+ "loss": 10.722113037109375,
192
+ "step": 180
193
+ },
194
+ {
195
+ "epoch": 0.7421875,
196
+ "grad_norm": 0.3930608928203583,
197
+ "learning_rate": 2.6171875e-05,
198
+ "loss": 10.726372528076173,
199
+ "step": 190
200
+ },
201
+ {
202
+ "epoch": 0.78125,
203
+ "grad_norm": 0.3989920914173126,
204
+ "learning_rate": 2.2265625e-05,
205
+ "loss": 10.720677185058594,
206
+ "step": 200
207
+ },
208
+ {
209
+ "epoch": 0.78125,
210
+ "eval_loss": 10.718664169311523,
211
+ "eval_runtime": 2.1839,
212
+ "eval_samples_per_second": 29.305,
213
+ "eval_steps_per_second": 29.305,
214
+ "step": 200
215
+ },
216
+ {
217
+ "epoch": 0.8203125,
218
+ "grad_norm": 0.20773838460445404,
219
+ "learning_rate": 1.8359375e-05,
220
+ "loss": 10.722401428222657,
221
+ "step": 210
222
+ },
223
+ {
224
+ "epoch": 0.859375,
225
+ "grad_norm": 0.31527552008628845,
226
+ "learning_rate": 1.4453125e-05,
227
+ "loss": 10.71902084350586,
228
+ "step": 220
229
+ },
230
+ {
231
+ "epoch": 0.87890625,
232
+ "eval_loss": 10.71699047088623,
233
+ "eval_runtime": 2.0948,
234
+ "eval_samples_per_second": 30.552,
235
+ "eval_steps_per_second": 30.552,
236
+ "step": 225
237
+ },
238
+ {
239
+ "epoch": 0.8984375,
240
+ "grad_norm": 0.2394668161869049,
241
+ "learning_rate": 1.0546875e-05,
242
+ "loss": 10.706802368164062,
243
+ "step": 230
244
+ },
245
+ {
246
+ "epoch": 0.9375,
247
+ "grad_norm": 0.23905383050441742,
248
+ "learning_rate": 6.6406250000000005e-06,
249
+ "loss": 10.697933959960938,
250
+ "step": 240
251
+ },
252
+ {
253
+ "epoch": 0.9765625,
254
+ "grad_norm": 0.3481399416923523,
255
+ "learning_rate": 2.734375e-06,
256
+ "loss": 10.72392578125,
257
+ "step": 250
258
+ },
259
+ {
260
+ "epoch": 0.9765625,
261
+ "eval_loss": 10.716254234313965,
262
+ "eval_runtime": 2.164,
263
+ "eval_samples_per_second": 29.576,
264
+ "eval_steps_per_second": 29.576,
265
+ "step": 250
266
  },
267
  {
268
  "epoch": 1.0,
269
+ "step": 256,
270
+ "total_flos": 39501942396.0,
271
+ "train_loss": 10.738998085260391,
272
+ "train_runtime": 49.7234,
273
+ "train_samples_per_second": 5.148,
274
+ "train_steps_per_second": 5.148
275
  }
276
  ],
277
+ "logging_steps": 10,
278
+ "max_steps": 256,
279
  "num_input_tokens_seen": 0,
280
  "num_train_epochs": 1,
281
+ "save_steps": 25,
282
  "stateful_callbacks": {
283
  "TrainerControl": {
284
  "args": {
 
291
  "attributes": {}
292
  }
293
  },
294
+ "total_flos": 39501942396.0,
295
  "train_batch_size": 1,
296
  "trial_name": null,
297
  "trial_params": null
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a142cd858828160b38820775c31c8d19bc13769b29cdfea16c1835758c53a125
3
- size 5265
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cd3f671622cfbd73cb8882721494529eb95691b413c227bd5e8e5d124f3ca09e
3
+ size 5201