File size: 6,458 Bytes
ab05cb6
 
 
 
 
 
 
 
 
 
 
 
6b49457
ab05cb6
 
 
6b49457
 
 
ab05cb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b49457
ab05cb6
 
6b49457
 
ab05cb6
 
 
 
 
 
 
 
 
 
6b49457
ab05cb6
6b49457
ab05cb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b49457
 
 
 
 
 
 
 
 
 
 
ab05cb6
 
 
 
 
6b49457
ab05cb6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: transformers
base_model: SupraLabs/Supra-1.5-50M-Base-exp
base_model_relation: finetune
datasets:
- nvidia/Nemotron-SFT-Instruction-Following-Chat-v2
- microsoft/orca-math-word-problems-200k
- TIGER-Lab/MathInstruct
- User01110/math-curated-dataset
- Programming-Language/codeagent-python
- Cutecat6152/python-data-basic
- flytech/python-codes-25k
- QuixiAI/open-instruct-uncensored
- openai/gsm8k
- EleutherAI/arithmetic
tags:
- sft
- exact-loss-trainer
- chatml
- python
- math
- code
- instruction-tuned
---

# testing-50M

This is an experimental instruction SFT run from `SupraLabs/Supra-1.5-50M-Base-exp`.

## Training Setup

| Field | Value |
| --- | --- |
| Base model | `SupraLabs/Supra-1.5-50M-Base-exp` |
| Base revision | `main` |
| Output repo | `User01110/testing-50M` |
| Sequence length | 1024 |
| Max optimizer steps | 10,000 |
| Per-device batch size | 128 |
| Gradient accumulation | 4 |
| Sample presentations per GPU | 5,120,000 |
| Max token slots per GPU | 5,242,880,000 |
| Learning rate | 2.00e-04 |
| Warmup steps | 100 |
| Weight decay | 0.05 |
| Save/push cadence | every 1,000 optimizer steps plus final |
| Loss masking | assistant-span-only from step 0 |
| Loss logging | printed `loss` is normalized by gradient accumulation; `raw_sum` is the Trainer sum over 4 microbatches |
| Gate logging | novelty score if the loaded architecture exposes `last_gate`; otherwise `n/a` |
| Prompt format | ChatML |
| System prompt | `You are a helpful assistant.` |

The stream randomly mixes the selected instruction, math, and coding sources. Sources are reopened after exhaustion and keep relooping until the 10,000-step training cap finishes, except `Cutecat6152/python-data-basic`, which is capped at 3 passes.

Listed source rows before relooping: 3,718,915. The 10,000-step training budget presents 5,120,000 examples per GPU.

## Prompt Template Compatibility

The uploaded tokenizer includes the ChatML special tokens and chat template, so inference and future SFT should not require manually adding `<|im_start|>` or `<|im_end|>`.

ChatML messages are rendered as:

```text
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{ user_message }<|im_end|>
<|im_start|>assistant
```

This script starts from the base checkpoint, adds `<|im_start|>` and `<|im_end|>` once as tokenizer special tokens, resizes embeddings once, saves the tokenizer with `chat_template`, disables automatic post-processing during pretokenized SFT, and keeps/saves the model context config with `max_position_embeddings >= 1024`.

The base model is loaded with pinned revision `main` so Transformers will not silently fetch a newer remote modeling file during training.

Complete inference example:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "User01110/testing-50M"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain what a neural network is in simple terms."},
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        temperature=0.7,
        top_k=40,
        top_p=0.95,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

new_tokens = output[0, inputs["input_ids"].shape[-1]:]
text = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
print(text)
```

## Dataset Mix

| Dataset | Config | Split | Rows | Schema | Mapping | Pass policy |
| --- | --- | --- | ---: | --- | --- | --- |
| nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 | default | reasoning_off | 1,068,273 | messages[{role, content, reasoning_content}] | user/assistant message pairs; reasoning_off only | reloops until max_steps |
| microsoft/orca-math-word-problems-200k | default | train | 200,035 | question, answer | user=question; assistant=answer | reloops until max_steps |
| TIGER-Lab/MathInstruct | default | train | 262,039 | source, instruction, output | user=instruction; assistant=output | reloops until max_steps |
| User01110/math-curated-dataset | default | train | 50,944 | id, source, prompt, index, model, response, chatml | user=prompt; assistant=response; rebuilds clean ChatML | reloops until max_steps |
| Programming-Language/codeagent-python | default | train | 296,837 | prompt, response | user=prompt; assistant=response | reloops until max_steps |
| Cutecat6152/python-data-basic | default | train | 100 | id, instruction, response | user=instruction; assistant=response | max 3 passes, 300 presentations max |
| flytech/python-codes-25k | default | train | 49,626 | instruction, input, output, text | user=instruction plus optional Input block; assistant=output | reloops until max_steps |
| QuixiAI/open-instruct-uncensored | default | train | 1,756,115 | dataset, id, messages[{role, content}] | user/assistant message pairs | reloops until max_steps |
| openai/gsm8k | main | train | 7,473 | question, answer | user=question; assistant=answer | reloops until max_steps |
| openai/gsm8k | socratic | train | 7,473 | question, answer | user=question; assistant=answer | reloops until max_steps |
| EleutherAI/arithmetic | 10 validation subsets | validation raw JSONL | 20,000 | context, completion | user=context with trailing Answer: stripped; assistant=completion | reloops until max_steps |

## Notes

- Dataset schemas and row counts were checked through Hugging Face Dataset Viewer metadata where available.
- Multiturn/message datasets carry all assistant spans into the collator, so user/system text remains masked from step 0 while every assistant turn is supervised.
- Streaming source open/read failures are retried and reopened. Normal stream exhaustion reopens that source and continues mixing it until `max_steps`; `python-data-basic` is dropped after 3 completed passes.
- RoPE buffers and tokenizer/model load are verified during final export.