Pothana Base 300M
A 345M parameter LLaMA-style language model trained from scratch on Telugu text.
Named after Bammera Pothana, the celebrated 15th-century Telugu poet who authored the Andhra Maha Bhagavatamu.
Developed by Dvitva AI.
Model Details
| Model | pothana-base-300M |
| Architecture | LLaMA (RoPE + SwiGLU + RMSNorm) |
| Parameters | 345M |
| Hidden size | 1024 |
| Layers | 20 |
| Attention heads | 16 |
| Intermediate size | 2816 |
| Context length | 2048 |
| Vocab size | 86,071 |
| Tokenizer | Morfessor + BPE (Telugu morpheme-aware) |
| Training | Single GPU, bf16 mixed precision |
| Developed by | Dvitva AI |
Quick Start
Using pipeline
from transformers import pipeline
pipe = pipeline("text-generation", model="dvitvaai/pothana-base-300M", trust_remote_code=True)
result = pipe("తెలుగు భాష", max_new_tokens=50, do_sample=True, temperature=0.8)
print(result[0]["generated_text"])
Note:
trust_remote_code=Trueis required for the custom tokenizer that handles@@morpheme joining. Without it,@@markers will appear in the output.
Manual loading
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("dvitvaai/pothana-base-300M")
tokenizer = AutoTokenizer.from_pretrained("dvitvaai/pothana-base-300M", trust_remote_code=True)
# Input must be Morfessor-segmented (with @@ continuation markers)
segmented_text = "తెలుగు భాష చాలా అందమైన@@ ది"
inputs = tokenizer(segmented_text, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.8,
top_k=50,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Tokenizer
This model uses a Morfessor + BPE hybrid tokenizer designed for Telugu:
- Telugu text: Segmented into morphemes using Morfessor with
@@continuation markers - Non-Telugu text (English, numbers, URLs): Handled by BPE subword encoding
- Fallback: Character-level encoding for out-of-vocabulary tokens
Important: The tokenizer expects pre-segmented input (with @@ markers). For raw Telugu text, you need to run Morfessor segmentation first.
Full pipeline (raw Telugu text)
For raw Telugu text, segment with Morfessor first:
import morfessor
# Load Morfessor model
io = morfessor.MorfessorIO()
morf_model = io.read_binary_model_file("morfessor_telugu.bin")
def segment_telugu(text, separator="@@"):
import re
TELUGU_RE = re.compile(r"[\u0C00-\u0C7F]+")
tokens = []
for word in text.split():
if TELUGU_RE.fullmatch(word):
segments = morf_model.viterbi_segment(word)[0]
for i, seg in enumerate(segments):
tokens.append(seg + separator if i < len(segments) - 1 else seg)
else:
tokens.append(word)
return " ".join(tokens)
# Segment, then tokenize and generate
raw_text = "తెలుగు భాష చాలా అందమైనది"
segmented = segment_telugu(raw_text)
inputs = tokenizer(segmented, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training
- Data: Telugu text corpus (Sangraha dataset)
- Preprocessing: Morfessor morpheme segmentation + BPE for non-Telugu
- Optimizer: AdamW (lr=3e-4, weight_decay=0.1, beta1=0.9, beta2=0.95)
- Schedule: Cosine LR decay with 500-step warmup
- Precision: bf16 mixed precision
- Hardware: Single GPU
Limitations
- This is a base model (not instruction-tuned) — it performs text completion, not instruction following
- The tokenizer requires Morfessor-segmented input for best results
- Trained primarily on Telugu text; limited multilingual capability
- Small model size (345M) limits reasoning and knowledge capacity
License
Apache 2.0
Citation
If you use this model, please cite:
@misc{pothana-base-300M,
title={Pothana Base 300M: A Telugu Language Model},
author={Dvitva AI},
year={2025},
url={https://huggingface.co/dvitvaai/pothana-base-300M}
}
- Downloads last month
- 49