smol-llama 🦙

A 360M parameter LLaMA-style language model pre-trained from scratch on 6 billion tokens of web data. This model demonstrates that high-quality small language models can be trained efficiently on a single GPU.

Model Description

smol-llama is a compact implementation of the LLaMA architecture, featuring modern techniques like Grouped Query Attention (GQA), RoPE embeddings, and SwiGLU activations. It was trained on the ifkash/fineweb-6b dataset.

Model Architecture

Component	Value
Parameters	360M
Hidden Dimension	960
Layers	32
Attention Heads	15 (Query) / 5 (KV)
Head Dimension	64
Context Length	2048
Vocabulary Size	49,152
Architecture	LLaMA-style decoder-only

Key Features:

🚀 Grouped Query Attention (GQA): 3:1 ratio for efficient inference
🔄 RoPE: Rotary Position Embeddings for better length generalization
📊 RMSNorm: Root Mean Square Layer Normalization
⚡ SwiGLU: Gated linear unit activation in FFN
💾 Flash Attention 2: Memory-efficient attention computation
🎯 Gradient Checkpointing: Enables training larger batches

Training Details

Dataset

Trained on ifkash/fineweb-6b, a curated subset of the FineWeb dataset containing ~6 billion high-quality web tokens.

Training Hyperparameters

Hyperparameter	Value
Optimizer	AdamW (fused)
Learning Rate	3e-4 (peak)
LR Schedule	Cosine with linear warmup
Warmup Steps	900
Total Steps	5,725 (~1 epoch)
Batch Size	64
Gradient Accumulation	8
Effective Batch Size	512 sequences
Context Length	2048 tokens
Tokens per Step	~1M
Total Tokens	~6B
Precision	bfloat16
Gradient Clipping	1.0

Infrastructure

Resource	Specification
GPU	1× NVIDIA H100 (80GB PCIe)
Training Time	~22 hours
Throughput	~75,000 tokens/sec
Cloud Provider	RunPod
Cost	~$53 (total)

Training Loss

The model was trained for one full epoch over the dataset with checkpoints saved every 200 steps. Final training loss: ~2.8 (see training checkpoints for intermediate metrics).

Quick Start

Installation

pip install torch transformers accelerate

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "ifkash/smol-llama"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Remove token_type_ids if present (not used by LLaMA models)
if 'token_type_ids' in inputs:
    del inputs['token_type_ids']

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Generation

# More controlled generation
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.1,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Batch Generation

prompts = [
    "Once upon a time",
    "The key to success is",
    "In the year 2050,",
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)

for i, output in enumerate(outputs):
    print(f"\nPrompt {i+1}: {prompts[i]}")
    print(f"Generated: {tokenizer.decode(output, skip_special_tokens=True)}")

Loading from Custom Checkpoint Format

If you want to load the original training checkpoints:

import torch
from transformers import PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("ifkash/smol-llama")

# Load custom checkpoint
checkpoint_path = "training_checkpoints/checkpoint_step_5000.pt"
ckpt = torch.load(checkpoint_path, map_location="cuda")

# Create model from scratch (you'll need the model definition)
from utils.model import Llama, ModelArgs
model = Llama(ModelArgs()).cuda().to(torch.bfloat16)

# Handle torch.compile prefix if present
state_dict = {k.replace("_orig_mod.", ""): v for k, v in ckpt['model'].items()}
model.load_state_dict(state_dict)
model.eval()

# Generate
def generate(prompt, max_tokens=50):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
    
    with torch.no_grad():
        for _ in range(max_tokens):
            logits, _ = model(input_ids[:, -2048:])
            next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
            input_ids = torch.cat([input_ids, next_token], dim=1)
            if next_token.item() == tokenizer.eos_token_id:
                break
    
    return tokenizer.decode(input_ids[0])

print(generate("The meaning of life is"))

Training Checkpoints

Intermediate training checkpoints are available in the training_checkpoints/ folder:

Checkpoint	Steps	Tokens Seen	Loss
`checkpoint_step_200.pt`	200	~200M	-
`checkpoint_step_400.pt`	400	~400M	-
...	...	...	-
`checkpoint_step_4800.pt`	4,800	~4.8B	-
`checkpoint_step_5000.pt`	5,000	~5B	-

These checkpoints include full training state (model, optimizer, step, loss) and can be used to resume training or analyze training dynamics.

Limitations

This is a small model trained on a limited dataset (~6B tokens) for demonstration purposes. As such, it has several limitations:

Limited Knowledge: The model has only seen 6B tokens, compared to 100B+ for larger models
Generalization: May not perform well on out-of-distribution tasks
Factual Accuracy: Should not be relied upon for factual information
Biases: Inherits biases present in the web-scraped training data
No Instruction Tuning: This is a base model without instruction following or chat capabilities
No Safety Alignment: Has not undergone safety training or RLHF

Intended Use

This model is intended for:

✅ Research and experimentation with small language models
✅ Educational purposes and learning about LLM pre-training
✅ Fine-tuning on downstream tasks
✅ Exploring efficient training techniques
✅ Prototyping and proof-of-concept projects

This model is not intended for:

❌ Production deployments without further fine-tuning
❌ Safety-critical applications
❌ Generating factual information without verification
❌ Applications requiring instruction following (use an instruction-tuned variant)

Comparison with Similar Models

Model	Parameters	Context	Training Tokens	Hardware
smol-llama	360M	2048	6B	1× H100 (22h)
SmolLM-360M	360M	2048	600B	-
Pythia-410M	410M	2048	300B	-

Note: This model uses significantly fewer training tokens than similar-sized models, making it more accessible but potentially less capable on general tasks.

Training Code

The complete training code is available in the model repository. Key components:

# Clone the repository
git clone https://huggingface.co/ifkash/smol-llama
cd smol-llama

# Install dependencies
pip install torch transformers accelerate huggingface-hub wandb

# Run training (requires GPU)
python pretrain.py

See the repository files for complete implementation details including:

Custom LLaMA architecture (utils/model.py)
Rotary embeddings (utils/rotary.py)
Data loading utilities (utils/data.py)
Checkpoint management (utils/checkpoint.py)
Learning rate scheduling (utils/lr_schedule.py)

Citation

If you use this model in your research, please cite:

@misc{smol-llama-2026,
  author = {ifkash},
  title = {smol-llama: A 360M Parameter LLaMA Model Trained From Scratch},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/ifkash/smol-llama}
}

Also consider citing the FineWeb dataset:

@software{penedo2024fineweb,
  author = {Penedo, Guilherme and Kydlíček, Hynek and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Von Werra, Leandro and Wolf, Thomas},
  title = {FineWeb: decanting the web for the finest text data at scale},
  month = April,
  year = 2024,
  url = {https://huggingface.co/datasets/HuggingFaceFW/fineweb}
}

Resources

Model Repository: ifkash/smol-llama
Training Dataset: ifkash/fineweb-6b
Reference Implementation: HuggingFaceTB/SmolLM-360M

License

This model is released under the MIT License. See the LICENSE file for details.

Acknowledgments

Inspired by HuggingFaceTB/SmolLM-360M
Trained on FineWeb data
Built with PyTorch and Transformers

Downloads last month: 13