---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: transformers
base_model: SupraLabs/Supra-1.5-50M-Base-exp
base_model_relation: finetune
datasets:
- nvidia/Nemotron-SFT-Instruction-Following-Chat-v2
- microsoft/orca-math-word-problems-200k
- TIGER-Lab/MathInstruct
- User01110/math-curated-dataset
- Programming-Language/codeagent-python
- Cutecat6152/python-data-basic
- flytech/python-codes-25k
- QuixiAI/open-instruct-uncensored
- openai/gsm8k
- EleutherAI/arithmetic
tags:
- sft
- exact-loss-trainer
- chatml
- python
- math
- code
- instruction-tuned
---

# testing-50M

This is an experimental instruction SFT run from `SupraLabs/Supra-1.5-50M-Base-exp`.

## Training Setup

| Field | Value |
| --- | --- |
| Base model | `SupraLabs/Supra-1.5-50M-Base-exp` |
| Base revision | `main` |
| Output repo | `User01110/testing-50M` |
| Sequence length | 1024 |
| Max optimizer steps | 10,000 |
| Per-device batch size | 128 |
| Gradient accumulation | 4 |
| Sample presentations per GPU | 5,120,000 |
| Max token slots per GPU | 5,242,880,000 |
| Learning rate | 2.00e-04 |
| Warmup steps | 100 |
| Weight decay | 0.05 |
| Save/push cadence | every 1,000 optimizer steps plus final |
| Loss masking | assistant-span-only from step 0 |
| Loss logging | printed `loss` is normalized by gradient accumulation; `raw_sum` is the Trainer sum over 4 microbatches |
| Gate logging | novelty score if the loaded architecture exposes `last_gate`; otherwise `n/a` |
| Prompt format | ChatML |
| System prompt | `You are a helpful assistant.` |

The stream randomly mixes the selected instruction, math, and coding sources. Sources are reopened after exhaustion and keep relooping until the 10,000-step training cap finishes, except `Cutecat6152/python-data-basic`, which is capped at 3 passes.

Listed source rows before relooping: 3,718,915. The 10,000-step training budget presents 5,120,000 examples per GPU.

## Prompt Template Compatibility

The uploaded tokenizer includes the ChatML special tokens and chat template, so inference and future SFT should not require manually adding `<|im_start|>` or `<|im_end|>`.

ChatML messages are rendered as:

```text
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{ user_message }<|im_end|>
<|im_start|>assistant
```

This script starts from the base checkpoint, adds `<|im_start|>` and `<|im_end|>` once as tokenizer special tokens, resizes embeddings once, saves the tokenizer with `chat_template`, disables automatic post-processing during pretokenized SFT, and keeps/saves the model context config with `max_position_embeddings >= 1024`.

The base model is loaded with pinned revision `main` so Transformers will not silently fetch a newer remote modeling file during training.

Complete inference example:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "User01110/testing-50M"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain what a neural network is in simple terms."},
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        temperature=0.7,
        top_k=40,
        top_p=0.95,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

new_tokens = output[0, inputs["input_ids"].shape[-1]:]
text = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
print(text)
```

## Dataset Mix

| Dataset | Config | Split | Rows | Schema | Mapping | Pass policy |
| --- | --- | --- | ---: | --- | --- | --- |
| nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 | default | reasoning_off | 1,068,273 | messages[{role, content, reasoning_content}] | user/assistant message pairs; reasoning_off only | reloops until max_steps |
| microsoft/orca-math-word-problems-200k | default | train | 200,035 | question, answer | user=question; assistant=answer | reloops until max_steps |
| TIGER-Lab/MathInstruct | default | train | 262,039 | source, instruction, output | user=instruction; assistant=output | reloops until max_steps |
| User01110/math-curated-dataset | default | train | 50,944 | id, source, prompt, index, model, response, chatml | user=prompt; assistant=response; rebuilds clean ChatML | reloops until max_steps |
| Programming-Language/codeagent-python | default | train | 296,837 | prompt, response | user=prompt; assistant=response | reloops until max_steps |
| Cutecat6152/python-data-basic | default | train | 100 | id, instruction, response | user=instruction; assistant=response | max 3 passes, 300 presentations max |
| flytech/python-codes-25k | default | train | 49,626 | instruction, input, output, text | user=instruction plus optional Input block; assistant=output | reloops until max_steps |
| QuixiAI/open-instruct-uncensored | default | train | 1,756,115 | dataset, id, messages[{role, content}] | user/assistant message pairs | reloops until max_steps |
| openai/gsm8k | main | train | 7,473 | question, answer | user=question; assistant=answer | reloops until max_steps |
| openai/gsm8k | socratic | train | 7,473 | question, answer | user=question; assistant=answer | reloops until max_steps |
| EleutherAI/arithmetic | 10 validation subsets | validation raw JSONL | 20,000 | context, completion | user=context with trailing Answer: stripped; assistant=completion | reloops until max_steps |

## Notes

- Dataset schemas and row counts were checked through Hugging Face Dataset Viewer metadata where available.
- Multiturn/message datasets carry all assistant spans into the collator, so user/system text remains masked from step 0 while every assistant turn is supervised.
- Streaming source open/read failures are retried and reopened. Normal stream exhaustion reopens that source and continues mixing it until `max_steps`; `python-data-basic` is dropped after 3 completed passes.
- RoPE buffers and tokenizer/model load are verified during final export.