Instructions to use AjithBharadwaj/nano-llama-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AjithBharadwaj/nano-llama-base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AjithBharadwaj/nano-llama-base", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("AjithBharadwaj/nano-llama-base", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AjithBharadwaj/nano-llama-base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AjithBharadwaj/nano-llama-base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AjithBharadwaj/nano-llama-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/AjithBharadwaj/nano-llama-base

SGLang

How to use AjithBharadwaj/nano-llama-base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AjithBharadwaj/nano-llama-base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AjithBharadwaj/nano-llama-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AjithBharadwaj/nano-llama-base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AjithBharadwaj/nano-llama-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use AjithBharadwaj/nano-llama-base with Docker Model Runner:
```
docker model run hf.co/AjithBharadwaj/nano-llama-base
```

Nano-Llama Base

A compact language model (~166M parameters) based on Llama architecture with modern improvements:

🔄 Rotary Position Embeddings (RoPE) - Better positional encoding
📊 RMSNorm - More stable than LayerNorm
⚡ SwiGLU Activation - Enhanced feed-forward networks
🎯 Trained on FineWeb - High-quality web text dataset

Model Details

Property	Value
Parameters	~166M
Architecture	Transformer (Llama-style)
Context Length	1024 tokens
Vocabulary Size	50,304 (GPT-2 tokenizer)
Hidden Size	768
Layers	12
Attention Heads	12
Intermediate Size	2048
Precision	bfloat16 / float32

Quick Start

Installation

pip install transformers torch

Basic Usage (CPU)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "AjithBharadwaj/nano-llama-base",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("AjithBharadwaj/nano-llama-base")

# Generate text
inputs = tokenizer("Hello, I'm a language model,", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GPU Usage (Recommended)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Auto-detect device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load model on GPU with bfloat16 for better performance
model = AutoModelForCausalLM.from_pretrained(
    "AjithBharadwaj/nano-llama-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16  # Reduces memory by 50%
).to(device)

tokenizer = AutoTokenizer.from_pretrained("AjithBharadwaj/nano-llama-base")

# Generate on GPU
inputs = tokenizer("The future of AI is", return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using Pipeline (Easiest)

from transformers import pipeline
import torch

device = 0 if torch.cuda.is_available() else -1

generator = pipeline(
    "text-generation",
    model="AjithBharadwaj/nano-llama-base",
    trust_remote_code=True,
    device=device
)

output = generator(
    "Once upon a time",
    max_length=100,
    do_sample=True,
    temperature=0.8,
    top_k=50
)
print(output[0]['generated_text'])

Training Details

Optimization

Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
Learning Rate: Cosine decay with warmup
Precision: Mixed precision training (bfloat16)
Gradient Clipping: 1.0

Dataset

Source: FineWeb (high-quality filtered web text)
Preprocessing: GPT-2 tokenization
Context Window: 1024 tokens

Architecture Highlights

Rotary Position Embeddings (RoPE)

Enables better length extrapolation and relative position encoding without absolute position embeddings.

RMSNorm

Uses Root Mean Square Layer Normalization for improved training stability and slightly faster computation.

SwiGLU Activation

Implements Swish-Gated Linear Unit (SwiGLU) in the feed-forward network:

Limitations

⚠️ Important Considerations:

This is a base model without instruction tuning or alignment
May generate biased, factually incorrect, or inappropriate content
Trained primarily on English text
Context limited to 1024 tokens
Not suitable for production without safety measures

Use responsibly and implement appropriate content filtering.

Model Card Authors

Ajith Bharadwaj

Citation

@misc{nano-llama-base-2026,
  author = {Ajith Bharadwaj},
  title = {Nano-Llama Base: A Compact Language Model with Llama Architecture},
  year = {2026},
  publisher = {HuggingFace Hub},
  url = {https://huggingface.co/AjithBharadwaj/nano-llama-base},
  note = {166M parameter transformer model with RoPE, RMSNorm, and SwiGLU}
}

License

MIT License - Free for commercial and research use.

Acknowledgments

Architecture inspired by Meta's Llama models
Trained on FineWeb dataset by HuggingFace
Built with PyTorch and Transformers library

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32

AjithBharadwaj
/

nano-llama-base