smol-llama πŸ¦™

A 360M parameter LLaMA-style language model pre-trained from scratch on 6 billion tokens of web data. This model demonstrates that high-quality small language models can be trained efficiently on a single GPU.

Model Description

smol-llama is a compact implementation of the LLaMA architecture, featuring modern techniques like Grouped Query Attention (GQA), RoPE embeddings, and SwiGLU activations. It was trained on the ifkash/fineweb-6b dataset.

Model Architecture

Component Value
Parameters 360M
Hidden Dimension 960
Layers 32
Attention Heads 15 (Query) / 5 (KV)
Head Dimension 64
Context Length 2048
Vocabulary Size 49,152
Architecture LLaMA-style decoder-only

Key Features:

  • πŸš€ Grouped Query Attention (GQA): 3:1 ratio for efficient inference
  • πŸ”„ RoPE: Rotary Position Embeddings for better length generalization
  • πŸ“Š RMSNorm: Root Mean Square Layer Normalization
  • ⚑ SwiGLU: Gated linear unit activation in FFN
  • πŸ’Ύ Flash Attention 2: Memory-efficient attention computation
  • 🎯 Gradient Checkpointing: Enables training larger batches

Training Details

Dataset

Trained on ifkash/fineweb-6b, a curated subset of the FineWeb dataset containing ~6 billion high-quality web tokens.

Training Hyperparameters

Hyperparameter Value
Optimizer AdamW (fused)
Learning Rate 3e-4 (peak)
LR Schedule Cosine with linear warmup
Warmup Steps 900
Total Steps 5,725 (~1 epoch)
Batch Size 64
Gradient Accumulation 8
Effective Batch Size 512 sequences
Context Length 2048 tokens
Tokens per Step ~1M
Total Tokens ~6B
Precision bfloat16
Gradient Clipping 1.0

Infrastructure

Resource Specification
GPU 1Γ— NVIDIA H100 (80GB PCIe)
Training Time ~22 hours
Throughput ~75,000 tokens/sec
Cloud Provider RunPod
Cost ~$53 (total)

Training Loss

The model was trained for one full epoch over the dataset with checkpoints saved every 200 steps. Final training loss: ~2.8 (see training checkpoints for intermediate metrics).

Quick Start

Installation

pip install torch transformers accelerate

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "ifkash/smol-llama"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Remove token_type_ids if present (not used by LLaMA models)
if 'token_type_ids' in inputs:
    del inputs['token_type_ids']

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Generation

# More controlled generation
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.1,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Batch Generation

prompts = [
    "Once upon a time",
    "The key to success is",
    "In the year 2050,",
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)

for i, output in enumerate(outputs):
    print(f"\nPrompt {i+1}: {prompts[i]}")
    print(f"Generated: {tokenizer.decode(output, skip_special_tokens=True)}")

Loading from Custom Checkpoint Format

If you want to load the original training checkpoints:

import torch
from transformers import PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("ifkash/smol-llama")

# Load custom checkpoint
checkpoint_path = "training_checkpoints/checkpoint_step_5000.pt"
ckpt = torch.load(checkpoint_path, map_location="cuda")

# Create model from scratch (you'll need the model definition)
from utils.model import Llama, ModelArgs
model = Llama(ModelArgs()).cuda().to(torch.bfloat16)

# Handle torch.compile prefix if present
state_dict = {k.replace("_orig_mod.", ""): v for k, v in ckpt['model'].items()}
model.load_state_dict(state_dict)
model.eval()

# Generate
def generate(prompt, max_tokens=50):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
    
    with torch.no_grad():
        for _ in range(max_tokens):
            logits, _ = model(input_ids[:, -2048:])
            next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
            input_ids = torch.cat([input_ids, next_token], dim=1)
            if next_token.item() == tokenizer.eos_token_id:
                break
    
    return tokenizer.decode(input_ids[0])

print(generate("The meaning of life is"))

Training Checkpoints

Intermediate training checkpoints are available in the training_checkpoints/ folder:

Checkpoint Steps Tokens Seen Loss
checkpoint_step_200.pt 200 ~200M -
checkpoint_step_400.pt 400 ~400M -
... ... ... -
checkpoint_step_4800.pt 4,800 ~4.8B -
checkpoint_step_5000.pt 5,000 ~5B -

These checkpoints include full training state (model, optimizer, step, loss) and can be used to resume training or analyze training dynamics.

Limitations

This is a small model trained on a limited dataset (~6B tokens) for demonstration purposes. As such, it has several limitations:

  • Limited Knowledge: The model has only seen 6B tokens, compared to 100B+ for larger models
  • Generalization: May not perform well on out-of-distribution tasks
  • Factual Accuracy: Should not be relied upon for factual information
  • Biases: Inherits biases present in the web-scraped training data
  • No Instruction Tuning: This is a base model without instruction following or chat capabilities
  • No Safety Alignment: Has not undergone safety training or RLHF

Intended Use

This model is intended for:

  • βœ… Research and experimentation with small language models
  • βœ… Educational purposes and learning about LLM pre-training
  • βœ… Fine-tuning on downstream tasks
  • βœ… Exploring efficient training techniques
  • βœ… Prototyping and proof-of-concept projects

This model is not intended for:

  • ❌ Production deployments without further fine-tuning
  • ❌ Safety-critical applications
  • ❌ Generating factual information without verification
  • ❌ Applications requiring instruction following (use an instruction-tuned variant)

Comparison with Similar Models

Model Parameters Context Training Tokens Hardware
smol-llama 360M 2048 6B 1Γ— H100 (22h)
SmolLM-360M 360M 2048 600B -
Pythia-410M 410M 2048 300B -

Note: This model uses significantly fewer training tokens than similar-sized models, making it more accessible but potentially less capable on general tasks.

Training Code

The complete training code is available in the model repository. Key components:

# Clone the repository
git clone https://huggingface.co/ifkash/smol-llama
cd smol-llama

# Install dependencies
pip install torch transformers accelerate huggingface-hub wandb

# Run training (requires GPU)
python pretrain.py

See the repository files for complete implementation details including:

  • Custom LLaMA architecture (utils/model.py)
  • Rotary embeddings (utils/rotary.py)
  • Data loading utilities (utils/data.py)
  • Checkpoint management (utils/checkpoint.py)
  • Learning rate scheduling (utils/lr_schedule.py)

Citation

If you use this model in your research, please cite:

@misc{smol-llama-2026,
  author = {ifkash},
  title = {smol-llama: A 360M Parameter LLaMA Model Trained From Scratch},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/ifkash/smol-llama}
}

Also consider citing the FineWeb dataset:

@software{penedo2024fineweb,
  author = {Penedo, Guilherme and Kydlíček, Hynek and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Von Werra, Leandro and Wolf, Thomas},
  title = {FineWeb: decanting the web for the finest text data at scale},
  month = April,
  year = 2024,
  url = {https://huggingface.co/datasets/HuggingFaceFW/fineweb}
}

Resources

License

This model is released under the MIT License. See the LICENSE file for details.

Acknowledgments

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support