Instructions to use dvitvaai/pothana-base-v2-225M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dvitvaai/pothana-base-v2-225M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dvitvaai/pothana-base-v2-225M", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("dvitvaai/pothana-base-v2-225M", trust_remote_code=True, device_map="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dvitvaai/pothana-base-v2-225M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dvitvaai/pothana-base-v2-225M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dvitvaai/pothana-base-v2-225M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/dvitvaai/pothana-base-v2-225M

SGLang

How to use dvitvaai/pothana-base-v2-225M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dvitvaai/pothana-base-v2-225M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dvitvaai/pothana-base-v2-225M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dvitvaai/pothana-base-v2-225M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dvitvaai/pothana-base-v2-225M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use dvitvaai/pothana-base-v2-225M with Docker Model Runner:
```
docker model run hf.co/dvitvaai/pothana-base-v2-225M
```

Pothana Base v2 — 225M Telugu Language Model

A ~~225M parameter LLaMA-style decoder pretrained from scratch on a mixed Telugu (~~91%) + English (~9%) corpus with a hybrid morfessor + BPE tokenizer. Designed as a strong base model for downstream retrieval-augmented and instruction fine-tuning on Telugu.

Status: pretrained base model. Not yet instruction-tuned or RAG-aligned.

Quick start

pip install "transformers>=4.40,<5.0" morfessor

⚠️ transformers 5.x is not supported yet. The tokenizers 0.22+ dependency in transformers 5.x has a WordLevel encoding regression that char-fragments our morfessor-segmented input. Pin to a 4.x release. Tested on 4.55–4.57.

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="dvitvaai/pothana-base-v2-225M",
    trust_remote_code=True,
)
print(pipe("నేను రేపు ఆఫీసుకు వెళ్లాలి"))

Or with the lower-level API:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "dvitvaai/pothana-base-v2-225M", trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "dvitvaai/pothana-base-v2-225M", trust_remote_code=True,
)

# Raw Telugu input — the tokenizer runs morfessor v4 segmentation internally.
inputs = tokenizer("నేను రేపు ఆఫీసుకు వెళ్లాలి", return_tensors="pt")
out = model.generate(
    **inputs,
    max_new_tokens=80,
    do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.15,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))

trust_remote_code=True is required for:

The model class (PothanaForCausalLM): LLaMA + QK-norm
The tokenizer class (PothanaTokenizer): runs morfessor v4 segmentation on Telugu input and strips @@ continuation prefix at decode

The morfessor package is required so the tokenizer can segment raw Telugu text the way training did. The morfessor model (morfessor_telugu.bin) and supporting files are shipped in the repo and loaded automatically.

Generation defaults: a generation_config.json is shipped with do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.15 because the model loops badly under greedy decoding (see Limitations).

Architecture


Parameters	222M unique (370M on disk due to weight-sharing unroll)
Hidden size	768
Layers (unique)	24
Layers (effective, with weight sharing)	48
Attention heads	16 query, 4 key/value (GQA, ratio 4:1)
Head dim	48
Intermediate (SwiGLU)	2048
Activation	SwiGLU
Norm	RMSNorm (eps=1e-6)
Position encoding	RoPE, θ=500,000
QK-norm	yes (RMSNorm on Q and K, Llama 3.1 style)
Tied embeddings	no (lm_head separate from wte; +36M params for capacity)
Vocab size	47,831
Max context	4,096

Weight sharing (MobileLLM-LS style)

24 unique transformer blocks; each unique block runs twice in sequence (block-wise weight sharing). HF representation unrolls this to 48 layers with duplicated weights, so standard from_pretrained() works without custom logic.

Tokenizer

Type: morfessor_bpe_telugu_v4 (custom)
Vocab: 47,831 tokens
- Telugu morphemes (segmented via Morfessor on the Sangraha Telugu corpus)
- BPE subwords for non-Telugu text (8000 merges) → enables English coverage
- Character fallback for OOV Telugu
- 4 base special tokens: <pad>=0, <unk>=1, <bos>=2, <eos>=3
- 9 reserved retrieval special tokens (IDs 47822–47830): <search>, </search>, <retrieved>, </retrieved>, <doc>, </doc>, <cite>, <think>, </think>. Unused during base pretraining — reserved for downstream retrieval fine-tuning.
Continuation marker: @@ prefix on morphemes that attach to the previous word (e.g., మా @@కు → మాకు).
Preprocessing: the PothanaTokenizer class runs morfessor v4 segmentation on Telugu input automatically. The morfessor_telugu.bin, suffix_set.json, and word_frequencies.txt sidecar files are shipped in the repo and loaded at first use. Requires pip install morfessor.

English fertility

Measured on Wikipedia samples: ~1.81 tokens/word, 0% UNK rate. About 2× worse than a dedicated English BPE tokenizer — acceptable since English is only ~9% of training data.

Training data

Telugu: ~3.07B tokens, sourced from the Sangraha corpus, morfessor-segmented
English: ~~300M tokens (~~10%) from wikimedia/wikipedia (20231101.en), tokenized via the BPE fallback
Mix: 3.37B total training tokens (91.2% Telugu / 8.8% English)
UNK rate on Telugu training set: 8.8e-6 (essentially zero)

The two languages are concatenated (train.bin is Telugu followed by English) and the dataloader uses random uniform sampling across the full file — sequences are effectively independent draws.

Training procedure

Hardware: 1× NVIDIA B200 (192 GB HBM)
Wall time: 48.6 hours
Total steps: 8,000
Effective batch: 512 sequences × 4,096 tokens = 2.1M tokens/step
Total training tokens: ~16.8B (≈ 5 epochs over the 3.37B-token corpus)
Optimizer: AdamW, β=(0.9, 0.95), weight decay 0.1, grad clip 1.0
Learning rate: peak 5e-4 with WSD schedule (warmup 3,000 steps → stable to step 5,600 → linear decay to 5e-5 at step 8,000)
Loss: cross-entropy + z-loss (λ=1e-4) for output normalization
Mixed precision: bf16 with fp32 master weights
Throughput: ~95,800 tokens/sec sustained

Loss trajectory

Step	val_loss (training-time)	notes
500	5.5729	start of training
1,500	4.0242
3,000	3.5619	warmup ends
5,000	3.3740	end of stable-LR phase
6,000	3.2867	mid decay
7,500	3.1856	last training-time eval
8,000 (final)	3.1631	deterministic eval, 40 batches × 8 × 4,096

Architectural tier-1 improvements over prior baselines (in this project)

Feature	Value	Why
Untied embeddings	+36M params for dedicated `lm_head`	Capacity improvement, ~0.05 NLL expected
QK-norm	RMSNorm on Q, K before RoPE	Long-context stability (Llama 3.1, Cosmos)
z-loss	λ=1e-4	Prevents logit drift (PaLM, Gemini)
4096 context	from 2048 baseline	Headroom for downstream retrieval
WSD schedule	70% stable / 30% linear decay	More efficient than cosine at this scale
10% English mix	~300M Wikipedia tokens	Cross-lingual capability for future retrieval over English sources

Evaluation

Final val loss (held-out Telugu + English mix): 3.1631 (perplexity ≈ 23.6)

Comparable models in this project's history:

Prior engram baseline (235M, 9000 steps, 2048 ctx, no QK-norm, tied emb): val 3.42 — 0.26 NLL worse

This Base v2 represents ~30% perplexity reduction over the engram baseline.

External benchmarks (IndicGLUE, TyDi-QA-Telugu, etc.) have not been run yet for this checkpoint and will be added when available.

Intended use

This is a pretrained base model, not an instruction-tuned model. It is suitable as a starting point for:

Telugu text continuation / completion experiments
Fine-tuning for downstream tasks (classification, NER, summarization)
Retrieval-augmented generation (RAG) fine-tuning — the special tokens for retrieval are already in the vocabulary; see project notes on RETRIEVAL.md for the planned post-training pipeline (continued pretrain → SFT → DPO → verifier)
Research on small-scale Telugu language modeling

The model is not suitable for direct use as a chat assistant without further fine-tuning.

Limitations

No instruction tuning: zero-shot prompts will get continuation-style outputs, not Q&A-style responses.
Small parameter count (225M): limited factual knowledge; reasoning depth is modest.
Tokenizer needs morfessor: the PothanaTokenizer class runs morfessor segmentation internally, but requires pip install morfessor. First call is slow (~5–10s warming the segmentation cache from word_frequencies.txt); subsequent calls are fast.
English fertility is suboptimal (~1.81 tok/word vs ~0.75 for dedicated English BPE) — English-heavy use cases would benefit from a different tokenizer.
Telugu Wikipedia and high-quality Telugu factual data are limited in the training corpus; the model's factual knowledge is heavily skewed toward what appears in Sangraha (general web Telugu).
No safety / alignment work has been done. The base model can produce toxic, biased, or fabricated content. Do not use in production without adding appropriate guardrails.

Citation

If you use this model, please reference:

@misc{pothana-base-v2-225M,
  title  = {Pothana Base v2: A 225M Telugu LLaMA-style language model with QK-norm},
  author = {Katrapati, Ganesh},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/dvitvaai/pothana-base-v2-225M}},
}

Acknowledgments

Training corpus from AI4Bharat Sangraha
English data from Wikimedia Foundation
Architecture inspired by LLaMA, MobileLLM-LS (weight sharing), Llama 3.1 (QK-norm)
Training schedule (WSD) follows recent recommendations from the SlowRun benchmark community

License

Apache 2.0. Free for research and commercial use with attribution.

Downloads last month: 14

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for dvitvaai/pothana-base-v2-225M

Finetunes

1 model