--- license: apache-2.0 language: - en pipeline_tag: text-generation library_name: transformers base_model: SupraLabs/Supra-1.5-50M-Base-exp base_model_relation: finetune datasets: - nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 - microsoft/orca-math-word-problems-200k - TIGER-Lab/MathInstruct - User01110/math-curated-dataset - Programming-Language/codeagent-python - Cutecat6152/python-data-basic - flytech/python-codes-25k - QuixiAI/open-instruct-uncensored - openai/gsm8k - EleutherAI/arithmetic tags: - sft - exact-loss-trainer - chatml - python - math - code - instruction-tuned --- # testing-50M This is an experimental instruction SFT run from `SupraLabs/Supra-1.5-50M-Base-exp`. ## Training Setup | Field | Value | | --- | --- | | Base model | `SupraLabs/Supra-1.5-50M-Base-exp` | | Base revision | `main` | | Output repo | `User01110/testing-50M` | | Sequence length | 1024 | | Max optimizer steps | 10,000 | | Per-device batch size | 128 | | Gradient accumulation | 4 | | Sample presentations per GPU | 5,120,000 | | Max token slots per GPU | 5,242,880,000 | | Learning rate | 2.00e-04 | | Warmup steps | 100 | | Weight decay | 0.05 | | Save/push cadence | every 1,000 optimizer steps plus final | | Loss masking | assistant-span-only from step 0 | | Loss logging | printed `loss` is normalized by gradient accumulation; `raw_sum` is the Trainer sum over 4 microbatches | | Gate logging | novelty score if the loaded architecture exposes `last_gate`; otherwise `n/a` | | Prompt format | ChatML | | System prompt | `You are a helpful assistant.` | The stream randomly mixes the selected instruction, math, and coding sources. Sources are reopened after exhaustion and keep relooping until the 10,000-step training cap finishes, except `Cutecat6152/python-data-basic`, which is capped at 3 passes. Listed source rows before relooping: 3,718,915. The 10,000-step training budget presents 5,120,000 examples per GPU. ## Prompt Template Compatibility The uploaded tokenizer includes the ChatML special tokens and chat template, so inference and future SFT should not require manually adding `<|im_start|>` or `<|im_end|>`. ChatML messages are rendered as: ```text <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user { user_message }<|im_end|> <|im_start|>assistant ``` This script starts from the base checkpoint, adds `<|im_start|>` and `<|im_end|>` once as tokenizer special tokens, resizes embeddings once, saves the tokenizer with `chat_template`, disables automatic post-processing during pretokenized SFT, and keeps/saves the model context config with `max_position_embeddings >= 1024`. The base model is loaded with pinned revision `main` so Transformers will not silently fetch a newer remote modeling file during training. Complete inference example: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch repo = "User01110/testing-50M" tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( repo, trust_remote_code=True, torch_dtype="auto", device_map="auto", ) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain what a neural network is in simple terms."}, ] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=256, do_sample=False, temperature=0.7, top_k=40, top_p=0.95, repetition_penalty=1.2, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) new_tokens = output[0, inputs["input_ids"].shape[-1]:] text = tokenizer.decode(new_tokens, skip_special_tokens=True).strip() print(text) ``` ## Dataset Mix | Dataset | Config | Split | Rows | Schema | Mapping | Pass policy | | --- | --- | --- | ---: | --- | --- | --- | | nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 | default | reasoning_off | 1,068,273 | messages[{role, content, reasoning_content}] | user/assistant message pairs; reasoning_off only | reloops until max_steps | | microsoft/orca-math-word-problems-200k | default | train | 200,035 | question, answer | user=question; assistant=answer | reloops until max_steps | | TIGER-Lab/MathInstruct | default | train | 262,039 | source, instruction, output | user=instruction; assistant=output | reloops until max_steps | | User01110/math-curated-dataset | default | train | 50,944 | id, source, prompt, index, model, response, chatml | user=prompt; assistant=response; rebuilds clean ChatML | reloops until max_steps | | Programming-Language/codeagent-python | default | train | 296,837 | prompt, response | user=prompt; assistant=response | reloops until max_steps | | Cutecat6152/python-data-basic | default | train | 100 | id, instruction, response | user=instruction; assistant=response | max 3 passes, 300 presentations max | | flytech/python-codes-25k | default | train | 49,626 | instruction, input, output, text | user=instruction plus optional Input block; assistant=output | reloops until max_steps | | QuixiAI/open-instruct-uncensored | default | train | 1,756,115 | dataset, id, messages[{role, content}] | user/assistant message pairs | reloops until max_steps | | openai/gsm8k | main | train | 7,473 | question, answer | user=question; assistant=answer | reloops until max_steps | | openai/gsm8k | socratic | train | 7,473 | question, answer | user=question; assistant=answer | reloops until max_steps | | EleutherAI/arithmetic | 10 validation subsets | validation raw JSONL | 20,000 | context, completion | user=context with trailing Answer: stripped; assistant=completion | reloops until max_steps | ## Notes - Dataset schemas and row counts were checked through Hugging Face Dataset Viewer metadata where available. - Multiturn/message datasets carry all assistant spans into the collator, so user/system text remains masked from step 0 while every assistant turn is supervised. - Streaming source open/read failures are retried and reopened. Normal stream exhaustion reopens that source and continues mixing it until `max_steps`; `python-data-basic` is dropped after 3 completed passes. - RoPE buffers and tokenizer/model load are verified during final export.