---
library_name: transformers
license: apache-2.0
datasets:
- HuggingFaceTB/smol-smoltalk
language:
- en
pipeline_tag: text-generation
base_model:
- user-anto/Axiom-Dense-380M-Base
tags:
- causal-lm
- fine-tuned
- instruct-model
- custom-architecture
- pytorch
- tiktoken
- chatml
---

<p align="center">
  <img src="./axiom_logo.png" width="220">
</p>

# Axiom-Dense-380M-Instruct

Axiom-Dense-380M-Instruct is a fine-tuned, instruction-following decoder-only causal language model. It was trained by performing Supervised Fine-Tuning (SFT) on the base model [Axiom-Dense-380M-Base](https://huggingface.co/user-anto/Axiom-Dense-380M-Base) using instruction-response conversational data.

# Quickstart

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "user-anto/Axiom-Dense-380M-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu")

prompt = "<|im_start|>user\nWrite a short email to my team about meeting tomorrow.<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cpu")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.2,
        top_p=0.85,
        repetition_penalty=1.15,
        no_repeat_ngram_size=3,
    )

print(tokenizer.decode(outputs[0]))
```

## Model Summary

- Model type: decoder-only Transformer (causal LM)
- Parameter count: 385,849,344
- Context length: 1,024 tokens
- Vocabulary: 100,277 (`tiktoken` `cl100k_base` with ChatML special tokens patched)
- Training objective: Autoregressive supervised fine-tuning (SFT) using target masking (only computing loss on the assistant's responses)
- Prompt format: ChatML (`<|im_start|>`, `<|im_end|>`)

## Architecture

This model preserves the same dense Transformer stack as the base model, but utilizes added special tokens to delimit speaker turns during inference.

- Hidden size: 1024
- Layers: 24
- Attention heads: 16
- KV heads: 8 (GQA)
- FFN multiplier: 2.6667 (rounded to 2816 intermediate dimension)
- Normalization: RMSNorm
- Positional encoding: RoPE (`theta=10000`)
- Activation: SwiGLU
- Special tokens: `<|im_start|>` (100264) and `<|im_end|>` (100265) for ChatML boundaries

## Training Data

- Source dataset: `HuggingFaceTB/smol-smoltalk`
- Local dataset path during training: `data/smol-smoltalk`
- SFT targets: Computes loss only on assistant response tokens, masking out prompt and user tokens.
- Total training tokens: 204,802,175 (~0.205B tokens)
- Validation tokens: 197,825 tokens

## SFT Training Setup

- Effective tokens per optimizer step: 319,488 (`batch_size=1`, `seq_len=1024`, `grad_accum=312`)
- Total optimizer steps: 641
- Optimizer: AdamW8bit (with bitsandbytes)
- LR schedule: warmup, constant phase, cosine decay
- Warmup steps: 51 steps (8% of training)
- Cosine decay phase: 102 steps (16% of training, starting at step 539)
- LR max/min: 3e-4 / 3e-5 (initial learning rate starts at 1.5e-4 during warmup)
- Weight decay: 0.1
- Precision: bfloat16
- Gradient checkpointing: enabled

## Evaluation Snapshot

- Pretraining base perplexity: 18.1233
- Best observed SFT eval loss: 1.2641 at step 630
- Best observed SFT eval perplexity: 3.5398 at step 630
- Final SFT step (640) eval loss: 1.2868
- Final SFT step (640) eval perplexity: 3.6210

The SFT process successfully aligned the model to follow prompt formats and drastically reduced perplexity on conversational validation targets.

## Chat Format

This model uses the standard **ChatML** system format. A typical chat turn looks like:

```text
<|im_start|>user
Write a short email to my team about meeting tomorrow.<|im_end|>
<|im_start|>assistant
Subject: Meeting Tomorrow...<|im_end|>
```

## Intended Use

- Assistant-style task completion
- Multi-turn conversational chat
- Zero-shot and few-shot instruction-following
- Educational use and custom model inference experimentation

## Out-of-Scope / Limitations

- Safety-critical domains (medical, legal, financial advice)
- Deployment in production without robust safety classifiers and filters
- Handling long contexts beyond the 1,024-token limit
- Language support beyond English (which dominates the smoltalk dataset)

## Tokenization

- Tokenizer: `tiktoken` with `cl100k_base` base ranks
- Patched special tokens:
  - `<|endoftext|>` = 100257 (EOS/PAD)
  - `<|im_start|>` = 100264
  - `<|im_end|>` = 100265
  - `<|endofprompt|>` = 100276