User01110 commited on
Commit
ab05cb6
·
verified ·
1 Parent(s): b9b6532

Upload checkpoint step 1,000

Browse files
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
+ base_model: SupraLabs/Supra-1.5-50M-Base-exp
8
+ base_model_relation: finetune
9
+ datasets:
10
+ - nvidia/Nemotron-SFT-Instruction-Following-Chat-v2
11
+ - Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned
12
+ - MBZUAI/LaMini-instruction
13
+ - ketchup123/tulu-gsm8k-openmath-instruct-100k-LF
14
+ - NecroMOnk/khan-math-linear_algebra
15
+ - endurasolution/ron-math-dataset
16
+ - User01110/math-curated-dataset
17
+ - microsoft/orca-math-word-problems-200k
18
+ - TIGER-Lab/MathInstruct
19
+ - openai/gsm8k
20
+ - EleutherAI/arithmetic
21
+ - Programming-Language/codeagent-python
22
+ - jan-hq/multiturn_programming_binarized
23
+ - Cutecat6152/python-data-basic
24
+ - flytech/python-codes-25k
25
+ tags:
26
+ - sft
27
+ - exact-loss-trainer
28
+ - chatml
29
+ - python
30
+ - math
31
+ - code
32
+ - instruction-tuned
33
+ ---
34
+
35
+ # testing-50M
36
+
37
+ This is an experimental instruction SFT run from `SupraLabs/Supra-1.5-50M-Base-exp`.
38
+
39
+ ## Training Setup
40
+
41
+ | Field | Value |
42
+ | --- | --- |
43
+ | Base model | `SupraLabs/Supra-1.5-50M-Base-exp` |
44
+ | Base revision | `main` |
45
+ | Output repo | `User01110/testing-50M` |
46
+ | Sequence length | 1024 |
47
+ | Max optimizer steps | 20,000 |
48
+ | Per-device batch size | 128 |
49
+ | Gradient accumulation | 4 |
50
+ | Sample presentations per GPU | 10,240,000 |
51
+ | Max token slots per GPU | 10,485,760,000 |
52
+ | Learning rate | 2.00e-04 |
53
+ | Warmup steps | 100 |
54
+ | Weight decay | 0.05 |
55
+ | Save/push cadence | every 1,000 optimizer steps plus final |
56
+ | Loss masking | assistant-span-only from step 0 |
57
+ | Loss logging | printed `loss` is normalized by gradient accumulation; `raw_sum` is the Trainer sum over 4 microbatches |
58
+ | Gate logging | novelty score if the loaded architecture exposes `last_gate`; otherwise `n/a` |
59
+ | Prompt format | ChatML |
60
+ | System prompt | `You are a helpful assistant.` |
61
+
62
+ The stream randomly mixes math, coding, and conversation-heavy instruction sources. Sources are reopened after exhaustion and keep relooping until the 20,000-step training cap finishes.
63
+
64
+ Listed source rows before relooping: 35,728,143. The 20,000-step training budget presents 10,240,000 examples per GPU.
65
+
66
+ ## Prompt Template Compatibility
67
+
68
+ The uploaded tokenizer includes the ChatML special tokens and chat template, so inference and future SFT should not require manually adding `<|im_start|>` or `<|im_end|>`.
69
+
70
+ ChatML messages are rendered as:
71
+
72
+ ```text
73
+ <|im_start|>system
74
+ You are a helpful assistant.<|im_end|>
75
+ <|im_start|>user
76
+ { user_message }<|im_end|>
77
+ <|im_start|>assistant
78
+ ```
79
+
80
+ This script starts from the base checkpoint, adds `<|im_start|>` and `<|im_end|>` once as tokenizer special tokens, resizes embeddings once, saves the tokenizer with `chat_template`, disables automatic post-processing during pretokenized SFT, and keeps/saves the model context config with `max_position_embeddings >= 1024`.
81
+
82
+ The base model is loaded with pinned revision `main` so Transformers will not silently fetch a newer remote modeling file during training.
83
+
84
+ Complete inference example:
85
+
86
+ ```python
87
+ from transformers import AutoModelForCausalLM, AutoTokenizer
88
+ import torch
89
+
90
+ repo = "User01110/testing-50M"
91
+ tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
92
+ model = AutoModelForCausalLM.from_pretrained(
93
+ repo,
94
+ trust_remote_code=True,
95
+ torch_dtype="auto",
96
+ device_map="auto",
97
+ )
98
+
99
+ messages = [
100
+ {"role": "system", "content": "You are a helpful assistant."},
101
+ {"role": "user", "content": "Explain what a neural network is in simple terms."},
102
+ ]
103
+ prompt = tokenizer.apply_chat_template(
104
+ messages,
105
+ tokenize=False,
106
+ add_generation_prompt=True,
107
+ )
108
+ inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
109
+
110
+ with torch.no_grad():
111
+ output = model.generate(
112
+ **inputs,
113
+ max_new_tokens=256,
114
+ do_sample=False,
115
+ temperature=0.7,
116
+ top_k=40,
117
+ top_p=0.95,
118
+ repetition_penalty=1.2,
119
+ pad_token_id=tokenizer.pad_token_id,
120
+ eos_token_id=tokenizer.eos_token_id,
121
+ )
122
+
123
+ new_tokens = output[0, inputs["input_ids"].shape[-1]:]
124
+ text = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
125
+ print(text)
126
+ ```
127
+
128
+ ## Dataset Mix
129
+
130
+ | Dataset | Config | Split | Rows | Schema | Mapping | Pass policy |
131
+ | --- | --- | --- | ---: | --- | --- | --- |
132
+ | nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 | default | reasoning_off | 1,068,273 | messages[{role, content}], uuid, license, used_in, reasoning | ChatML conversation turns; reasoning_off split only | reloops until max_steps |
133
+ | Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned | General-Distillation | train | 187,794 | conversations[{from, value}], input, output, domain, meta | human/gpt turns; assistant <think> blocks stripped | reloops until max_steps |
134
+ | Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned | General-Math | train | 76,727 | conversations[{from, value}], input, output, domain, meta | human/gpt turns; assistant <think> blocks stripped | reloops until max_steps |
135
+ | Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned | MultilingualSTEM | train | 89,997 | conversations[{from, value}], input, output, domain, meta | human/gpt turns; assistant <think> blocks stripped | reloops until max_steps |
136
+ | Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned | PHD-Science | train | 103,307 | conversations[{from, value}], input, output, domain, meta | human/gpt turns; assistant <think> blocks stripped | reloops until max_steps |
137
+ | MBZUAI/LaMini-instruction | default | train | 2,585,615 | instruction, response, instruction_source | instruction -> response | reloops until max_steps |
138
+ | ketchup123/tulu-gsm8k-openmath-instruct-100k-LF | default | train | 100,000 | conversations[{role, content}] | math conversations to ChatML turns | reloops until max_steps |
139
+ | NecroMOnk/khan-math-linear_algebra | default | train | 1,295,000 | messages[{role, content}], topic, subtopic | math tutor messages to ChatML turns | reloops until max_steps |
140
+ | endurasolution/ron-math-dataset | default | train | 29,226,764 | instruction, input, output | instruction + optional input -> output | reloops until max_steps |
141
+ | User01110/math-curated-dataset | default | train | 50,944 | id, source, prompt, index, model, response, chatml | prompt -> response; ignores source ChatML column and rebuilds clean ChatML | reloops until max_steps |
142
+ | microsoft/orca-math-word-problems-200k | default | train | 200,035 | question, answer | question -> answer | reloops until max_steps |
143
+ | TIGER-Lab/MathInstruct | default | train | 262,039 | source, instruction, output | instruction -> output | reloops until max_steps |
144
+ | openai/gsm8k | main | train | 7,473 | question, answer | question -> answer | reloops until max_steps |
145
+ | openai/gsm8k | socratic | train | 7,473 | question, answer | question -> answer | reloops until max_steps |
146
+ | EleutherAI/arithmetic | 10 validation subsets | validation | 20,000 | context, completion | direct parquet URLs to avoid dataset-script loader failure | reloops until max_steps |
147
+ | Programming-Language/codeagent-python | default | train | 296,837 | prompt, response | prompt -> response | reloops until max_steps |
148
+ | jan-hq/multiturn_programming_binarized | default | train | 100,139 | messages[{role, content}] | single/multiturn programming messages; all assistant spans labeled | reloops until max_steps |
149
+ | Cutecat6152/python-data-basic | default | train | 100 | id, instruction, response | instruction -> response | reloops until max_steps |
150
+ | flytech/python-codes-25k | default | train | 49,626 | instruction, input, output, text | instruction + optional input -> output | reloops until max_steps |
151
+
152
+ ## Notes
153
+
154
+ - Dataset schemas and row counts were checked through Hugging Face Dataset Viewer metadata where available.
155
+ - Multiturn/message datasets carry all assistant spans into the collator, so user/system text remains masked from step 0 while every assistant turn is supervised.
156
+ - Kimi assistant text has `<think>...</think>` blocks stripped before tokenization.
157
+ - Streaming source open/read failures are retried and reopened. Normal stream exhaustion reopens that source and continues mixing it until `max_steps`.
158
+ - RoPE buffers and tokenizer/model load are verified during final export.
chat_template.jinja ADDED
@@ -0,0 +1 @@
 
 
1
+ {% for message in messages %}{{ '<|im_start|>' + message['role'] + '\n' + (message['content'] | trim) + '<|im_end|>\n' }}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dtype": "float32",
9
+ "eos_token_id": 2,
10
+ "head_dim": 64,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 512,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 1408,
15
+ "max_position_embeddings": 5120,
16
+ "mlp_bias": false,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 8,
19
+ "num_hidden_layers": 12,
20
+ "num_key_value_heads": 4,
21
+ "pad_token_id": 1,
22
+ "pretraining_tp": 1,
23
+ "rms_norm_eps": 1e-06,
24
+ "rope_parameters": {
25
+ "factor": 1.0,
26
+ "rope_theta": 10000.0,
27
+ "rope_type": "linear",
28
+ "type": "linear"
29
+ },
30
+ "tie_word_embeddings": true,
31
+ "transformers_version": "5.10.2",
32
+ "use_cache": false,
33
+ "vocab_size": 32002
34
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": [
5
+ 2
6
+ ],
7
+ "pad_token_id": 1,
8
+ "transformers_version": "5.10.2"
9
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:777493dc618f20fa153dc09ca84f0fb151e4f59a0593660c639200d807c20747
3
+ size 207161232
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<s>",
4
+ "clean_up_tokenization_spaces": false,
5
+ "eos_token": "</s>",
6
+ "extra_special_tokens": [
7
+ "<|im_start|>",
8
+ "<|im_end|>"
9
+ ],
10
+ "is_local": false,
11
+ "local_files_only": false,
12
+ "model_max_length": 1000000000,
13
+ "pad_token": "<pad>",
14
+ "tokenizer_class": "TokenizersBackend",
15
+ "unk_token": "<unk>"
16
+ }