Nano-Llama Base

A compact language model (~166M parameters) based on Llama architecture with modern improvements:

  • 🔄 Rotary Position Embeddings (RoPE) - Better positional encoding
  • 📊 RMSNorm - More stable than LayerNorm
  • SwiGLU Activation - Enhanced feed-forward networks
  • 🎯 Trained on FineWeb - High-quality web text dataset

Model Details

Property Value
Parameters ~166M
Architecture Transformer (Llama-style)
Context Length 1024 tokens
Vocabulary Size 50,304 (GPT-2 tokenizer)
Hidden Size 768
Layers 12
Attention Heads 12
Intermediate Size 2048
Precision bfloat16 / float32

Quick Start

Installation

pip install transformers torch

Basic Usage (CPU)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "AjithBharadwaj/nano-llama-base",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("AjithBharadwaj/nano-llama-base")

# Generate text
inputs = tokenizer("Hello, I'm a language model,", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GPU Usage (Recommended)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Auto-detect device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load model on GPU with bfloat16 for better performance
model = AutoModelForCausalLM.from_pretrained(
    "AjithBharadwaj/nano-llama-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16  # Reduces memory by 50%
).to(device)

tokenizer = AutoTokenizer.from_pretrained("AjithBharadwaj/nano-llama-base")

# Generate on GPU
inputs = tokenizer("The future of AI is", return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using Pipeline (Easiest)

from transformers import pipeline
import torch

device = 0 if torch.cuda.is_available() else -1

generator = pipeline(
    "text-generation",
    model="AjithBharadwaj/nano-llama-base",
    trust_remote_code=True,
    device=device
)

output = generator(
    "Once upon a time",
    max_length=100,
    do_sample=True,
    temperature=0.8,
    top_k=50
)
print(output[0]['generated_text'])

Training Details

Optimization

  • Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
  • Learning Rate: Cosine decay with warmup
  • Precision: Mixed precision training (bfloat16)
  • Gradient Clipping: 1.0

Dataset

  • Source: FineWeb (high-quality filtered web text)
  • Preprocessing: GPT-2 tokenization
  • Context Window: 1024 tokens

Architecture Highlights

Rotary Position Embeddings (RoPE)

Enables better length extrapolation and relative position encoding without absolute position embeddings.

RMSNorm

Uses Root Mean Square Layer Normalization for improved training stability and slightly faster computation.

SwiGLU Activation

Implements Swish-Gated Linear Unit (SwiGLU) in the feed-forward network:

Limitations

⚠️ Important Considerations:

  • This is a base model without instruction tuning or alignment
  • May generate biased, factually incorrect, or inappropriate content
  • Trained primarily on English text
  • Context limited to 1024 tokens
  • Not suitable for production without safety measures

Use responsibly and implement appropriate content filtering.

Model Card Authors

Ajith Bharadwaj

Citation

@misc{nano-llama-base-2026,
  author = {Ajith Bharadwaj},
  title = {Nano-Llama Base: A Compact Language Model with Llama Architecture},
  year = {2026},
  publisher = {HuggingFace Hub},
  url = {https://huggingface.co/AjithBharadwaj/nano-llama-base},
  note = {166M parameter transformer model with RoPE, RMSNorm, and SwiGLU}
}

License

MIT License - Free for commercial and research use.

Acknowledgments

  • Architecture inspired by Meta's Llama models
  • Trained on FineWeb dataset by HuggingFace
  • Built with PyTorch and Transformers library
Downloads last month
7
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train AjithBharadwaj/nano-llama-base