smol-llama π¦
A 360M parameter LLaMA-style language model pre-trained from scratch on 6 billion tokens of web data. This model demonstrates that high-quality small language models can be trained efficiently on a single GPU.
Model Description
smol-llama is a compact implementation of the LLaMA architecture, featuring modern techniques like Grouped Query Attention (GQA), RoPE embeddings, and SwiGLU activations. It was trained on the ifkash/fineweb-6b dataset.
Model Architecture
| Component | Value |
|---|---|
| Parameters | 360M |
| Hidden Dimension | 960 |
| Layers | 32 |
| Attention Heads | 15 (Query) / 5 (KV) |
| Head Dimension | 64 |
| Context Length | 2048 |
| Vocabulary Size | 49,152 |
| Architecture | LLaMA-style decoder-only |
Key Features:
- π Grouped Query Attention (GQA): 3:1 ratio for efficient inference
- π RoPE: Rotary Position Embeddings for better length generalization
- π RMSNorm: Root Mean Square Layer Normalization
- β‘ SwiGLU: Gated linear unit activation in FFN
- πΎ Flash Attention 2: Memory-efficient attention computation
- π― Gradient Checkpointing: Enables training larger batches
Training Details
Dataset
Trained on ifkash/fineweb-6b, a curated subset of the FineWeb dataset containing ~6 billion high-quality web tokens.
Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW (fused) |
| Learning Rate | 3e-4 (peak) |
| LR Schedule | Cosine with linear warmup |
| Warmup Steps | 900 |
| Total Steps | 5,725 (~1 epoch) |
| Batch Size | 64 |
| Gradient Accumulation | 8 |
| Effective Batch Size | 512 sequences |
| Context Length | 2048 tokens |
| Tokens per Step | ~1M |
| Total Tokens | ~6B |
| Precision | bfloat16 |
| Gradient Clipping | 1.0 |
Infrastructure
| Resource | Specification |
|---|---|
| GPU | 1Γ NVIDIA H100 (80GB PCIe) |
| Training Time | ~22 hours |
| Throughput | ~75,000 tokens/sec |
| Cloud Provider | RunPod |
| Cost | ~$53 (total) |
Training Loss
The model was trained for one full epoch over the dataset with checkpoints saved every 200 steps. Final training loss: ~2.8 (see training checkpoints for intermediate metrics).
Quick Start
Installation
pip install torch transformers accelerate
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
model_name = "ifkash/smol-llama"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Remove token_type_ids if present (not used by LLaMA models)
if 'token_type_ids' in inputs:
del inputs['token_type_ids']
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Advanced Generation
# More controlled generation
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.8,
top_k=50,
top_p=0.95,
repetition_penalty=1.1,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Batch Generation
prompts = [
"Once upon a time",
"The key to success is",
"In the year 2050,",
]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=50,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
for i, output in enumerate(outputs):
print(f"\nPrompt {i+1}: {prompts[i]}")
print(f"Generated: {tokenizer.decode(output, skip_special_tokens=True)}")
Loading from Custom Checkpoint Format
If you want to load the original training checkpoints:
import torch
from transformers import PreTrainedTokenizerFast
# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("ifkash/smol-llama")
# Load custom checkpoint
checkpoint_path = "training_checkpoints/checkpoint_step_5000.pt"
ckpt = torch.load(checkpoint_path, map_location="cuda")
# Create model from scratch (you'll need the model definition)
from utils.model import Llama, ModelArgs
model = Llama(ModelArgs()).cuda().to(torch.bfloat16)
# Handle torch.compile prefix if present
state_dict = {k.replace("_orig_mod.", ""): v for k, v in ckpt['model'].items()}
model.load_state_dict(state_dict)
model.eval()
# Generate
def generate(prompt, max_tokens=50):
input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
with torch.no_grad():
for _ in range(max_tokens):
logits, _ = model(input_ids[:, -2048:])
next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
input_ids = torch.cat([input_ids, next_token], dim=1)
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(input_ids[0])
print(generate("The meaning of life is"))
Training Checkpoints
Intermediate training checkpoints are available in the training_checkpoints/ folder:
| Checkpoint | Steps | Tokens Seen | Loss |
|---|---|---|---|
checkpoint_step_200.pt |
200 | ~200M | - |
checkpoint_step_400.pt |
400 | ~400M | - |
| ... | ... | ... | - |
checkpoint_step_4800.pt |
4,800 | ~4.8B | - |
checkpoint_step_5000.pt |
5,000 | ~5B | - |
These checkpoints include full training state (model, optimizer, step, loss) and can be used to resume training or analyze training dynamics.
Limitations
This is a small model trained on a limited dataset (~6B tokens) for demonstration purposes. As such, it has several limitations:
- Limited Knowledge: The model has only seen 6B tokens, compared to 100B+ for larger models
- Generalization: May not perform well on out-of-distribution tasks
- Factual Accuracy: Should not be relied upon for factual information
- Biases: Inherits biases present in the web-scraped training data
- No Instruction Tuning: This is a base model without instruction following or chat capabilities
- No Safety Alignment: Has not undergone safety training or RLHF
Intended Use
This model is intended for:
- β Research and experimentation with small language models
- β Educational purposes and learning about LLM pre-training
- β Fine-tuning on downstream tasks
- β Exploring efficient training techniques
- β Prototyping and proof-of-concept projects
This model is not intended for:
- β Production deployments without further fine-tuning
- β Safety-critical applications
- β Generating factual information without verification
- β Applications requiring instruction following (use an instruction-tuned variant)
Comparison with Similar Models
| Model | Parameters | Context | Training Tokens | Hardware |
|---|---|---|---|---|
| smol-llama | 360M | 2048 | 6B | 1Γ H100 (22h) |
| SmolLM-360M | 360M | 2048 | 600B | - |
| Pythia-410M | 410M | 2048 | 300B | - |
Note: This model uses significantly fewer training tokens than similar-sized models, making it more accessible but potentially less capable on general tasks.
Training Code
The complete training code is available in the model repository. Key components:
# Clone the repository
git clone https://huggingface.co/ifkash/smol-llama
cd smol-llama
# Install dependencies
pip install torch transformers accelerate huggingface-hub wandb
# Run training (requires GPU)
python pretrain.py
See the repository files for complete implementation details including:
- Custom LLaMA architecture (
utils/model.py) - Rotary embeddings (
utils/rotary.py) - Data loading utilities (
utils/data.py) - Checkpoint management (
utils/checkpoint.py) - Learning rate scheduling (
utils/lr_schedule.py)
Citation
If you use this model in your research, please cite:
@misc{smol-llama-2026,
author = {ifkash},
title = {smol-llama: A 360M Parameter LLaMA Model Trained From Scratch},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/ifkash/smol-llama}
}
Also consider citing the FineWeb dataset:
@software{penedo2024fineweb,
author = {Penedo, Guilherme and KydlΓΔek, Hynek and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Von Werra, Leandro and Wolf, Thomas},
title = {FineWeb: decanting the web for the finest text data at scale},
month = April,
year = 2024,
url = {https://huggingface.co/datasets/HuggingFaceFW/fineweb}
}
Resources
- Model Repository: ifkash/smol-llama
- Training Dataset: ifkash/fineweb-6b
- Reference Implementation: HuggingFaceTB/SmolLM-360M
License
This model is released under the MIT License. See the LICENSE file for details.
Acknowledgments
- Inspired by HuggingFaceTB/SmolLM-360M
- Trained on FineWeb data
- Built with PyTorch and Transformers
- Downloads last month
- 13