---
library_name: transformers
license: apache-2.0
datasets:
- HuggingFaceTB/smol-smoltalk
language:
- en
pipeline_tag: text-generation
base_model:
- user-anto/Axiom-Dense-380M-Base
tags:
- causal-lm
- fine-tuned
- instruct-model
- custom-architecture
- pytorch
- tiktoken
- chatml
---
# Axiom-Dense-380M-Instruct
Axiom-Dense-380M-Instruct is a fine-tuned, instruction-following decoder-only causal language model. It was trained by performing Supervised Fine-Tuning (SFT) on the base model [Axiom-Dense-380M-Base](https://huggingface.co/user-anto/Axiom-Dense-380M-Base) using instruction-response conversational data.
# Quickstart
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "user-anto/Axiom-Dense-380M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu")
prompt = "<|im_start|>user\nWrite a short email to my team about meeting tomorrow.<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cpu")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.2,
top_p=0.85,
repetition_penalty=1.15,
no_repeat_ngram_size=3,
)
print(tokenizer.decode(outputs[0]))
```
## Model Summary
- Model type: decoder-only Transformer (causal LM)
- Parameter count: 385,849,344
- Context length: 1,024 tokens
- Vocabulary: 100,277 (`tiktoken` `cl100k_base` with ChatML special tokens patched)
- Training objective: Autoregressive supervised fine-tuning (SFT) using target masking (only computing loss on the assistant's responses)
- Prompt format: ChatML (`<|im_start|>`, `<|im_end|>`)
## Architecture
This model preserves the same dense Transformer stack as the base model, but utilizes added special tokens to delimit speaker turns during inference.
- Hidden size: 1024
- Layers: 24
- Attention heads: 16
- KV heads: 8 (GQA)
- FFN multiplier: 2.6667 (rounded to 2816 intermediate dimension)
- Normalization: RMSNorm
- Positional encoding: RoPE (`theta=10000`)
- Activation: SwiGLU
- Special tokens: `<|im_start|>` (100264) and `<|im_end|>` (100265) for ChatML boundaries
## Training Data
- Source dataset: `HuggingFaceTB/smol-smoltalk`
- Local dataset path during training: `data/smol-smoltalk`
- SFT targets: Computes loss only on assistant response tokens, masking out prompt and user tokens.
- Total training tokens: 204,802,175 (~0.205B tokens)
- Validation tokens: 197,825 tokens
## SFT Training Setup
- Effective tokens per optimizer step: 319,488 (`batch_size=1`, `seq_len=1024`, `grad_accum=312`)
- Total optimizer steps: 641
- Optimizer: AdamW8bit (with bitsandbytes)
- LR schedule: warmup, constant phase, cosine decay
- Warmup steps: 51 steps (8% of training)
- Cosine decay phase: 102 steps (16% of training, starting at step 539)
- LR max/min: 3e-4 / 3e-5 (initial learning rate starts at 1.5e-4 during warmup)
- Weight decay: 0.1
- Precision: bfloat16
- Gradient checkpointing: enabled
## Evaluation Snapshot
- Pretraining base perplexity: 18.1233
- Best observed SFT eval loss: 1.2641 at step 630
- Best observed SFT eval perplexity: 3.5398 at step 630
- Final SFT step (640) eval loss: 1.2868
- Final SFT step (640) eval perplexity: 3.6210
The SFT process successfully aligned the model to follow prompt formats and drastically reduced perplexity on conversational validation targets.
## Chat Format
This model uses the standard **ChatML** system format. A typical chat turn looks like:
```text
<|im_start|>user
Write a short email to my team about meeting tomorrow.<|im_end|>
<|im_start|>assistant
Subject: Meeting Tomorrow...<|im_end|>
```
## Intended Use
- Assistant-style task completion
- Multi-turn conversational chat
- Zero-shot and few-shot instruction-following
- Educational use and custom model inference experimentation
## Out-of-Scope / Limitations
- Safety-critical domains (medical, legal, financial advice)
- Deployment in production without robust safety classifiers and filters
- Handling long contexts beyond the 1,024-token limit
- Language support beyond English (which dominates the smoltalk dataset)
## Tokenization
- Tokenizer: `tiktoken` with `cl100k_base` base ranks
- Patched special tokens:
- `<|endoftext|>` = 100257 (EOS/PAD)
- `<|im_start|>` = 100264
- `<|im_end|>` = 100265
- `<|endofprompt|>` = 100276