Nano-Llama Base
A compact language model (~166M parameters) based on Llama architecture with modern improvements:
- 🔄 Rotary Position Embeddings (RoPE) - Better positional encoding
- 📊 RMSNorm - More stable than LayerNorm
- ⚡ SwiGLU Activation - Enhanced feed-forward networks
- 🎯 Trained on FineWeb - High-quality web text dataset
Model Details
| Property | Value |
|---|---|
| Parameters | ~166M |
| Architecture | Transformer (Llama-style) |
| Context Length | 1024 tokens |
| Vocabulary Size | 50,304 (GPT-2 tokenizer) |
| Hidden Size | 768 |
| Layers | 12 |
| Attention Heads | 12 |
| Intermediate Size | 2048 |
| Precision | bfloat16 / float32 |
Quick Start
Installation
pip install transformers torch
Basic Usage (CPU)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"AjithBharadwaj/nano-llama-base",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("AjithBharadwaj/nano-llama-base")
# Generate text
inputs = tokenizer("Hello, I'm a language model,", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GPU Usage (Recommended)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Auto-detect device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load model on GPU with bfloat16 for better performance
model = AutoModelForCausalLM.from_pretrained(
"AjithBharadwaj/nano-llama-base",
trust_remote_code=True,
torch_dtype=torch.bfloat16 # Reduces memory by 50%
).to(device)
tokenizer = AutoTokenizer.from_pretrained("AjithBharadwaj/nano-llama-base")
# Generate on GPU
inputs = tokenizer("The future of AI is", return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.8,
top_k=50,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Using Pipeline (Easiest)
from transformers import pipeline
import torch
device = 0 if torch.cuda.is_available() else -1
generator = pipeline(
"text-generation",
model="AjithBharadwaj/nano-llama-base",
trust_remote_code=True,
device=device
)
output = generator(
"Once upon a time",
max_length=100,
do_sample=True,
temperature=0.8,
top_k=50
)
print(output[0]['generated_text'])
Training Details
Optimization
- Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
- Learning Rate: Cosine decay with warmup
- Precision: Mixed precision training (bfloat16)
- Gradient Clipping: 1.0
Dataset
- Source: FineWeb (high-quality filtered web text)
- Preprocessing: GPT-2 tokenization
- Context Window: 1024 tokens
Architecture Highlights
Rotary Position Embeddings (RoPE)
Enables better length extrapolation and relative position encoding without absolute position embeddings.
RMSNorm
Uses Root Mean Square Layer Normalization for improved training stability and slightly faster computation.
SwiGLU Activation
Implements Swish-Gated Linear Unit (SwiGLU) in the feed-forward network:
Limitations
⚠️ Important Considerations:
- This is a base model without instruction tuning or alignment
- May generate biased, factually incorrect, or inappropriate content
- Trained primarily on English text
- Context limited to 1024 tokens
- Not suitable for production without safety measures
Use responsibly and implement appropriate content filtering.
Model Card Authors
Ajith Bharadwaj
Citation
@misc{nano-llama-base-2026,
author = {Ajith Bharadwaj},
title = {Nano-Llama Base: A Compact Language Model with Llama Architecture},
year = {2026},
publisher = {HuggingFace Hub},
url = {https://huggingface.co/AjithBharadwaj/nano-llama-base},
note = {166M parameter transformer model with RoPE, RMSNorm, and SwiGLU}
}
License
MIT License - Free for commercial and research use.
Acknowledgments
- Architecture inspired by Meta's Llama models
- Trained on FineWeb dataset by HuggingFace
- Built with PyTorch and Transformers library
- Downloads last month
- 7