--- library_name: transformers license: apache-2.0 datasets: - HuggingFaceTB/smol-smoltalk language: - en pipeline_tag: text-generation base_model: - user-anto/Axiom-Dense-380M-Base tags: - causal-lm - fine-tuned - instruct-model - custom-architecture - pytorch - tiktoken - chatml ---

# Axiom-Dense-380M-Instruct Axiom-Dense-380M-Instruct is a fine-tuned, instruction-following decoder-only causal language model. It was trained by performing Supervised Fine-Tuning (SFT) on the base model [Axiom-Dense-380M-Base](https://huggingface.co/user-anto/Axiom-Dense-380M-Base) using instruction-response conversational data. # Quickstart ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "user-anto/Axiom-Dense-380M-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu") prompt = "<|im_start|>user\nWrite a short email to my team about meeting tomorrow.<|im_end|>\n<|im_start|>assistant\n" inputs = tokenizer(prompt, return_tensors="pt").to("cpu") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=128, temperature=0.2, top_p=0.85, repetition_penalty=1.15, no_repeat_ngram_size=3, ) print(tokenizer.decode(outputs[0])) ``` ## Model Summary - Model type: decoder-only Transformer (causal LM) - Parameter count: 385,849,344 - Context length: 1,024 tokens - Vocabulary: 100,277 (`tiktoken` `cl100k_base` with ChatML special tokens patched) - Training objective: Autoregressive supervised fine-tuning (SFT) using target masking (only computing loss on the assistant's responses) - Prompt format: ChatML (`<|im_start|>`, `<|im_end|>`) ## Architecture This model preserves the same dense Transformer stack as the base model, but utilizes added special tokens to delimit speaker turns during inference. - Hidden size: 1024 - Layers: 24 - Attention heads: 16 - KV heads: 8 (GQA) - FFN multiplier: 2.6667 (rounded to 2816 intermediate dimension) - Normalization: RMSNorm - Positional encoding: RoPE (`theta=10000`) - Activation: SwiGLU - Special tokens: `<|im_start|>` (100264) and `<|im_end|>` (100265) for ChatML boundaries ## Training Data - Source dataset: `HuggingFaceTB/smol-smoltalk` - Local dataset path during training: `data/smol-smoltalk` - SFT targets: Computes loss only on assistant response tokens, masking out prompt and user tokens. - Total training tokens: 204,802,175 (~0.205B tokens) - Validation tokens: 197,825 tokens ## SFT Training Setup - Effective tokens per optimizer step: 319,488 (`batch_size=1`, `seq_len=1024`, `grad_accum=312`) - Total optimizer steps: 641 - Optimizer: AdamW8bit (with bitsandbytes) - LR schedule: warmup, constant phase, cosine decay - Warmup steps: 51 steps (8% of training) - Cosine decay phase: 102 steps (16% of training, starting at step 539) - LR max/min: 3e-4 / 3e-5 (initial learning rate starts at 1.5e-4 during warmup) - Weight decay: 0.1 - Precision: bfloat16 - Gradient checkpointing: enabled ## Evaluation Snapshot - Pretraining base perplexity: 18.1233 - Best observed SFT eval loss: 1.2641 at step 630 - Best observed SFT eval perplexity: 3.5398 at step 630 - Final SFT step (640) eval loss: 1.2868 - Final SFT step (640) eval perplexity: 3.6210 The SFT process successfully aligned the model to follow prompt formats and drastically reduced perplexity on conversational validation targets. ## Chat Format This model uses the standard **ChatML** system format. A typical chat turn looks like: ```text <|im_start|>user Write a short email to my team about meeting tomorrow.<|im_end|> <|im_start|>assistant Subject: Meeting Tomorrow...<|im_end|> ``` ## Intended Use - Assistant-style task completion - Multi-turn conversational chat - Zero-shot and few-shot instruction-following - Educational use and custom model inference experimentation ## Out-of-Scope / Limitations - Safety-critical domains (medical, legal, financial advice) - Deployment in production without robust safety classifiers and filters - Handling long contexts beyond the 1,024-token limit - Language support beyond English (which dominates the smoltalk dataset) ## Tokenization - Tokenizer: `tiktoken` with `cl100k_base` base ranks - Patched special tokens: - `<|endoftext|>` = 100257 (EOS/PAD) - `<|im_start|>` = 100264 - `<|im_end|>` = 100265 - `<|endofprompt|>` = 100276