Pothana Base 300M

A 345M parameter LLaMA-style language model trained from scratch on Telugu text.

Named after Bammera Pothana, the celebrated 15th-century Telugu poet who authored the Andhra Maha Bhagavatamu.

Developed by Dvitva AI.

Model Details

Model pothana-base-300M
Architecture LLaMA (RoPE + SwiGLU + RMSNorm)
Parameters 345M
Hidden size 1024
Layers 20
Attention heads 16
Intermediate size 2816
Context length 2048
Vocab size 86,071
Tokenizer Morfessor + BPE (Telugu morpheme-aware)
Training Single GPU, bf16 mixed precision
Developed by Dvitva AI

Quick Start

Using pipeline

from transformers import pipeline

pipe = pipeline("text-generation", model="dvitvaai/pothana-base-300M", trust_remote_code=True)
result = pipe("తెలుగు భాష", max_new_tokens=50, do_sample=True, temperature=0.8)
print(result[0]["generated_text"])

Note: trust_remote_code=True is required for the custom tokenizer that handles @@ morpheme joining. Without it, @@ markers will appear in the output.

Manual loading

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("dvitvaai/pothana-base-300M")
tokenizer = AutoTokenizer.from_pretrained("dvitvaai/pothana-base-300M", trust_remote_code=True)

# Input must be Morfessor-segmented (with @@ continuation markers)
segmented_text = "తెలుగు భాష చాలా అందమైన@@ ది"
inputs = tokenizer(segmented_text, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        do_sample=True,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Tokenizer

This model uses a Morfessor + BPE hybrid tokenizer designed for Telugu:

  • Telugu text: Segmented into morphemes using Morfessor with @@ continuation markers
  • Non-Telugu text (English, numbers, URLs): Handled by BPE subword encoding
  • Fallback: Character-level encoding for out-of-vocabulary tokens

Important: The tokenizer expects pre-segmented input (with @@ markers). For raw Telugu text, you need to run Morfessor segmentation first.

Full pipeline (raw Telugu text)

For raw Telugu text, segment with Morfessor first:

import morfessor

# Load Morfessor model
io = morfessor.MorfessorIO()
morf_model = io.read_binary_model_file("morfessor_telugu.bin")

def segment_telugu(text, separator="@@"):
    import re
    TELUGU_RE = re.compile(r"[\u0C00-\u0C7F]+")
    tokens = []
    for word in text.split():
        if TELUGU_RE.fullmatch(word):
            segments = morf_model.viterbi_segment(word)[0]
            for i, seg in enumerate(segments):
                tokens.append(seg + separator if i < len(segments) - 1 else seg)
        else:
            tokens.append(word)
    return " ".join(tokens)

# Segment, then tokenize and generate
raw_text = "తెలుగు భాష చాలా అందమైనది"
segmented = segment_telugu(raw_text)
inputs = tokenizer(segmented, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training

  • Data: Telugu text corpus (Sangraha dataset)
  • Preprocessing: Morfessor morpheme segmentation + BPE for non-Telugu
  • Optimizer: AdamW (lr=3e-4, weight_decay=0.1, beta1=0.9, beta2=0.95)
  • Schedule: Cosine LR decay with 500-step warmup
  • Precision: bf16 mixed precision
  • Hardware: Single GPU

Limitations

  • This is a base model (not instruction-tuned) — it performs text completion, not instruction following
  • The tokenizer requires Morfessor-segmented input for best results
  • Trained primarily on Telugu text; limited multilingual capability
  • Small model size (345M) limits reasoning and knowledge capacity

License

Apache 2.0

Citation

If you use this model, please cite:

@misc{pothana-base-300M,
  title={Pothana Base 300M: A Telugu Language Model},
  author={Dvitva AI},
  year={2025},
  url={https://huggingface.co/dvitvaai/pothana-base-300M}
}
Downloads last month
49
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dvitvaai/pothana-base-300M

Finetunes
1 model

Space using dvitvaai/pothana-base-300M 1