--- base_model: SupraLabs/Supra-1.5-50M-Base-exp library_name: transformers tags: - sft - chatml - trl - python - math - instruction-tuned --- # supralabs-50M-testing This is an experimental ChatML SFT run from `SupraLabs/Supra-1.5-50M-Base-exp`. ## Training Setup | Field | Value | | --- | --- | | Base model | `SupraLabs/Supra-1.5-50M-Base-exp` | | Output repo | `User01110/supralabs-50M-testing` | | Sequence length | 1024 | | Max optimizer steps | 10,000 | | Per-device batch size | 128 | | Gradient accumulation | 4 | | Sample presentations per GPU | 5,120,000 | | Max token slots per GPU | 5,242,880,000 | | Learning rate | 2.00e-04 | | Warmup steps | 100 | | Weight decay | 0.05 | | Save/push cadence | every 1,000 optimizer steps plus final | | Loss mask | assistant response only | | Chat format | ChatML | | System prompt | `You are a helpful assistant.` | The stream reloops datasets as needed to reach the fixed step budget. `Cutecat6152/python-data-basic` is capped at three passes because it only has 100 rows. Unique one-pass source rows listed below: 3,667,971. First-cycle source presentations with the `python-data-basic` cap included: 3,668,171. The 20k-step training budget presents 5,120,000 examples per GPU, so larger sources are expected to reloop during training. ## ChatML Compatibility The tokenizer is saved with: | Token | Purpose | | --- | --- | | `<|im_start|>` | ChatML message start | | `<|im_end|>` | ChatML message end | The uploaded tokenizer includes the ChatML template, so inference and future SFT should not require manually adding these tokens again. Example prompt: ```python messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain what a neural network is in simple terms."}, ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) ``` ## Dataset Mix | Dataset | Config | Split | Rows | Schema | Mapping | Pass policy | | --- | --- | --- | ---: | --- | --- | --- | | nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 | default | reasoning_off | 1,068,273 | messages[{role, content, reasoning_content}] | user/assistant message pairs; reasoning_off only | reloops as needed | | microsoft/orca-math-word-problems-200k | default | train | 200,035 | question, answer | user=question; assistant=answer | reloops as needed | | TIGER-Lab/MathInstruct | default | train | 262,039 | instruction, output | user=instruction; assistant=output | reloops as needed | | Programming-Language/codeagent-python | default | train | 296,837 | prompt, response | user=prompt; assistant=response | reloops as needed | | Cutecat6152/python-data-basic | default | train | 100 | id, instruction, response | user=instruction; assistant=response | max 3 passes, 300 presentations max | | flytech/python-codes-25k | default | train | 49,626 | instruction, input, output, text | user=instruction plus optional Input block; assistant=output | reloops as needed | | QuixiAI/open-instruct-uncensored | default | train | 1,756,115 | dataset, id, messages[{role, content}] | user/assistant message pairs | reloops as needed | | openai/gsm8k | main | train | 7,473 | question, answer | user=question; assistant=answer | reloops as needed | | openai/gsm8k | socratic | train | 7,473 | question, answer | user=question; assistant=answer | reloops as needed | | EleutherAI/arithmetic | 10 selected subsets | validation raw JSONL | 20,000 | context, completion | user=context with trailing Answer: stripped; assistant=completion | reloops as needed | ## Notes - Dataset schemas and row counts were checked through Hugging Face Dataset Viewer metadata where available. - Nemotron is loaded from the direct `reasoning_off.jsonl` file to avoid mixing in reasoning-on schema fields. - EleutherAI arithmetic is loaded from raw JSONL files to avoid old dataset-script loading issues. - RoPE buffers and tokenizer/model load are verified during final export.