Trellis-506M-SFT
Supervised fine-tuned version of Trellis-506M, a 506M parameter LLaMA-style language model optimized for structured output tasks — JSON generation, function calling, schema compliance, and structured extraction.
Model Details
| Parameter | Value |
|---|---|
| Base model | mdonigian/trellis-pretraining |
| Architecture | LLaMA (LlamaForCausalLM) |
| Parameters | ~506M |
| Hidden size | 1,280 |
| Layers | 24 |
| Attention heads | 20 (10 KV heads, GQA 2:1) |
| Context length | 2,048 |
| Vocab size | 50,304 (50,277 base + 6 chat tokens + padding) |
| Precision | bfloat16 |
SFT Training
Dataset
Trained on mdonigian/full-structured-instruction-sft-dataset (~95k examples), assembled from:
| Source | Target Count | Description |
|---|---|---|
| Glaive Function Calling v2 | ~70,000 | Multi-turn function calling conversations |
| UltraChat 200k | ~30,000 | General instruction-following dialogues |
| Hermes Function Calling v1 | ~20,000 | Single/multi-turn function calling + JSON mode |
| Synthetic JSON schema compliance | ~9,000 | Schema → correct JSON (generated with GPT-5-mini) |
| Synthetic structured extraction | ~5,000 | Text → structured JSON extraction (generated with GPT-5-mini) |
All examples are standardized to a common chat format using custom special tokens (see below). Source-specific filtering includes deduplication, token length capping (2048), and quality validation.
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Effective batch size | 32 |
| Learning rate | 2e-5 |
| LR schedule | Cosine decay |
| Warmup | 10% of total steps |
| Weight decay | 0.01 |
| Max gradient norm | 1.0 |
| Max sequence length | 2,048 |
| Optimizer | AdamW (fused) |
| Precision | bfloat16 |
| Seed | 42 |
Training Details
- Framework: TRL
SFTTrainer - Attention: Flash Attention 2
- Compilation:
torch.compileenabled - Loss masking: Completion-only — loss computed only on assistant response tokens, not system/user/tool tokens
- Hardware: NVIDIA B200
Chat Format
All training data uses these special tokens:
<|system|>You are a helpful assistant that generates valid JSON.<|end|>
<|user|>Generate a user profile with name, email, and age.<|end|>
<|assistant|>{"name": "Alice Chen", "email": "alice@example.com", "age": 28}<|end|>
| Token | Purpose |
|---|---|
<|system|> |
System prompt |
<|user|> |
User message |
<|assistant|> |
Assistant response |
<|tool_call|> |
Function/tool call |
<|tool_result|> |
Tool execution result |
<|end|> |
End of turn |
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"mdonigian/trellis-sft",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mdonigian/trellis-sft")
prompt = """<|system|>You are a helpful assistant that generates valid JSON.<|end|>
<|user|>Generate a JSON object for a book with title, author, year, and genre.<|end|>
<|assistant|>"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Experimental Design
This model is one component of a controlled experiment comparing curated pretraining vs. standard pretraining for structured output tasks:
| Model | Pretraining | Parameters | SFT |
|---|---|---|---|
| Trellis-506M-SFT (this model) | Curated 20B tokens | ~506M | Identical |
| Pythia-410M-deduped-SFT | The Pile (uncurated) | ~410M | Identical |
| Pythia-1B-deduped-SFT | The Pile (uncurated) | ~1B | Identical |
All three models undergo identical SFT with the same dataset, hyperparameters, and training procedure. Post-SFT evaluation covers:
- Tier 1: Custom structured output benchmarks (JSON schema compliance, structured extraction, classification)
- Tier 2: General NLP benchmarks via
lm_eval(HellaSwag, ARC, PIQA, Winogrande, MMLU) - Tier 3: Code benchmarks (HumanEval, MBPP)
Limitations
- 506M parameters limits general knowledge and complex reasoning
- Context length capped at 2,048 tokens
- No safety training, RLHF, or DPO alignment
- Optimized for structured output; general chat quality is limited
Citation
@misc{trellis-sft-2026,
title={Trellis-506M-SFT: Supervised Fine-Tuning for Structured Output},
author={Donigian, Matt},
year={2026},
url={https://huggingface.co/mdonigian/trellis-sft}
}
Model tree for mdonigian/trellis-sft
Base model
mdonigian/trellis-pretraining