Instructions to use user-anto/Axiom-Dense-380M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use user-anto/Axiom-Dense-380M-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="user-anto/Axiom-Dense-380M-Instruct", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("user-anto/Axiom-Dense-380M-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use user-anto/Axiom-Dense-380M-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "user-anto/Axiom-Dense-380M-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "user-anto/Axiom-Dense-380M-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/user-anto/Axiom-Dense-380M-Instruct

SGLang

How to use user-anto/Axiom-Dense-380M-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "user-anto/Axiom-Dense-380M-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "user-anto/Axiom-Dense-380M-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "user-anto/Axiom-Dense-380M-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "user-anto/Axiom-Dense-380M-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use user-anto/Axiom-Dense-380M-Instruct with Docker Model Runner:
```
docker model run hf.co/user-anto/Axiom-Dense-380M-Instruct
```

Axiom-Dense-380M-Instruct / README.md

user-anto

Update README.md

8735fed verified 7 days ago

preview code

raw

history blame contribute delete

4.49 kB

	---
	library_name: transformers
	license: apache-2.0
	datasets:
	- HuggingFaceTB/smol-smoltalk
	language:
	- en
	pipeline_tag: text-generation
	base_model:
	- user-anto/Axiom-Dense-380M-Base
	tags:
	- causal-lm
	- fine-tuned
	- instruct-model
	- custom-architecture
	- pytorch
	- tiktoken
	- chatml
	---

	<p align="center">
	<img src="./axiom_logo.png" width="220">
	</p>

	# Axiom-Dense-380M-Instruct

	Axiom-Dense-380M-Instruct is a fine-tuned, instruction-following decoder-only causal language model. It was trained by performing Supervised Fine-Tuning (SFT) on the base model [Axiom-Dense-380M-Base](https://huggingface.co/user-anto/Axiom-Dense-380M-Base) using instruction-response conversational data.

	# Quickstart

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "user-anto/Axiom-Dense-380M-Instruct"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu")

	prompt = "<\|im_start\|>user\nWrite a short email to my team about meeting tomorrow.<\|im_end\|>\n<\|im_start\|>assistant\n"
	inputs = tokenizer(prompt, return_tensors="pt").to("cpu")

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=128,
	temperature=0.2,
	top_p=0.85,
	repetition_penalty=1.15,
	no_repeat_ngram_size=3,
	)

	print(tokenizer.decode(outputs[0]))
	```

	## Model Summary

	- Model type: decoder-only Transformer (causal LM)
	- Parameter count: 385,849,344
	- Context length: 1,024 tokens
	- Vocabulary: 100,277 (`tiktoken` `cl100k_base` with ChatML special tokens patched)
	- Training objective: Autoregressive supervised fine-tuning (SFT) using target masking (only computing loss on the assistant's responses)
	- Prompt format: ChatML (`<\|im_start\|>`, `<\|im_end\|>`)

	## Architecture

	This model preserves the same dense Transformer stack as the base model, but utilizes added special tokens to delimit speaker turns during inference.

	- Hidden size: 1024
	- Layers: 24
	- Attention heads: 16
	- KV heads: 8 (GQA)
	- FFN multiplier: 2.6667 (rounded to 2816 intermediate dimension)
	- Normalization: RMSNorm
	- Positional encoding: RoPE (`theta=10000`)
	- Activation: SwiGLU
	- Special tokens: `<\|im_start\|>` (100264) and `<\|im_end\|>` (100265) for ChatML boundaries

	## Training Data

	- Source dataset: `HuggingFaceTB/smol-smoltalk`
	- Local dataset path during training: `data/smol-smoltalk`
	- SFT targets: Computes loss only on assistant response tokens, masking out prompt and user tokens.
	- Total training tokens: 204,802,175 (~0.205B tokens)
	- Validation tokens: 197,825 tokens

	## SFT Training Setup

	- Effective tokens per optimizer step: 319,488 (`batch_size=1`, `seq_len=1024`, `grad_accum=312`)
	- Total optimizer steps: 641
	- Optimizer: AdamW8bit (with bitsandbytes)
	- LR schedule: warmup, constant phase, cosine decay
	- Warmup steps: 51 steps (8% of training)
	- Cosine decay phase: 102 steps (16% of training, starting at step 539)
	- LR max/min: 3e-4 / 3e-5 (initial learning rate starts at 1.5e-4 during warmup)
	- Weight decay: 0.1
	- Precision: bfloat16
	- Gradient checkpointing: enabled

	## Evaluation Snapshot

	- Pretraining base perplexity: 18.1233
	- Best observed SFT eval loss: 1.2641 at step 630
	- Best observed SFT eval perplexity: 3.5398 at step 630
	- Final SFT step (640) eval loss: 1.2868
	- Final SFT step (640) eval perplexity: 3.6210

	The SFT process successfully aligned the model to follow prompt formats and drastically reduced perplexity on conversational validation targets.

	## Chat Format

	This model uses the standard ChatML system format. A typical chat turn looks like:

	```text
	<\|im_start\|>user
	Write a short email to my team about meeting tomorrow.<\|im_end\|>
	<\|im_start\|>assistant
	Subject: Meeting Tomorrow...<\|im_end\|>
	```

	## Intended Use

	- Assistant-style task completion
	- Multi-turn conversational chat
	- Zero-shot and few-shot instruction-following
	- Educational use and custom model inference experimentation

	## Out-of-Scope / Limitations

	- Safety-critical domains (medical, legal, financial advice)
	- Deployment in production without robust safety classifiers and filters
	- Handling long contexts beyond the 1,024-token limit
	- Language support beyond English (which dominates the smoltalk dataset)

	## Tokenization

	- Tokenizer: `tiktoken` with `cl100k_base` base ranks
	- Patched special tokens:
	- `<\|endoftext\|>` = 100257 (EOS/PAD)
	- `<\|im_start\|>` = 100264
	- `<\|im_end\|>` = 100265
	- `<\|endofprompt\|>` = 100276