dvitvaai
/

pothana-base-300M

@@ -12,14 +12,19 @@ library_name: transformers
 pipeline_tag: text-generation
 ---
-# Telugu LLaMA (345M)
 A **345M parameter** LLaMA-style language model trained **from scratch** on Telugu text.
 ## Model Details
 | | |
 |---|---|
 | **Architecture** | LLaMA (RoPE + SwiGLU + RMSNorm) |
 | **Parameters** | 345M |
 | **Hidden size** | 1024 |
@@ -30,27 +35,30 @@ A **345M parameter** LLaMA-style language model trained **from scratch** on Telu
 | **Vocab size** | 86,071 |
 | **Tokenizer** | Morfessor + BPE (Telugu morpheme-aware) |
 | **Training** | Single GPU, bf16 mixed precision |
-## Tokenizer
-This model uses a **Morfessor + BPE hybrid tokenizer** designed for Telugu:
-- **Telugu text**: Segmented into morphemes using [Morfessor](https://github.com/aalto-speech/morfessor) with `@@` continuation markers
-- **Non-Telugu text** (English, numbers, URLs): Handled by BPE subword encoding
-- **Fallback**: Character-level encoding for out-of-vocabulary tokens
-**Important**: The tokenizer expects **pre-segmented** input (with `@@` markers). For raw Telugu text, you need to run Morfessor segmentation first.
-## Usage
-### Basic usage (with pre-segmented text)
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
-model = AutoModelForCausalLM.from_pretrained("YOUR_USERNAME/telugu-llama-345m")
-tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/telugu-llama-345m", trust_remote_code=True)
 # Input must be Morfessor-segmented (with @@ continuation markers)
 segmented_text = "తెలుగు భాష చాలా అందమైన@@ ది"
@@ -68,6 +76,16 @@ with torch.no_grad():
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ### Full pipeline (raw Telugu text)
 For raw Telugu text, segment with Morfessor first:
@@ -119,3 +137,16 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ## License
 Apache 2.0

 pipeline_tag: text-generation
 ---
+# Pothana Base 300M
 A **345M parameter** LLaMA-style language model trained **from scratch** on Telugu text.
+Named after [Bammera Pothana](https://en.wikipedia.org/wiki/Bammera_Pothana), the celebrated 15th-century Telugu poet who authored the *Andhra Maha Bhagavatamu*.
+Developed by **[Dvitva AI](https://dvitva.ai)**.
 ## Model Details
 | | |
 |---|---|
+| **Model** | pothana-base-300M |
 | **Architecture** | LLaMA (RoPE + SwiGLU + RMSNorm) |
 | **Parameters** | 345M |
 | **Hidden size** | 1024 |
 | **Vocab size** | 86,071 |
 | **Tokenizer** | Morfessor + BPE (Telugu morpheme-aware) |
 | **Training** | Single GPU, bf16 mixed precision |
+| **Developed by** | [Dvitva AI](https://dvitva.ai) |
+## Quick Start
+### Using pipeline
+```python
+from transformers import pipeline
+pipe = pipeline("text-generation", model="dvitvaai/pothana-base-300M", trust_remote_code=True)
+result = pipe("తెలుగు భాష", max_new_tokens=50, do_sample=True, temperature=0.8)
+print(result[0]["generated_text"])
+```
+> **Note**: `trust_remote_code=True` is required for the custom tokenizer that handles `@@` morpheme joining. Without it, `@@` markers will appear in the output.
+### Manual loading
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
+model = AutoModelForCausalLM.from_pretrained("dvitvaai/pothana-base-300M")
+tokenizer = AutoTokenizer.from_pretrained("dvitvaai/pothana-base-300M", trust_remote_code=True)
 # Input must be Morfessor-segmented (with @@ continuation markers)
 segmented_text = "తెలుగు భాష చాలా అందమైన@@ ది"
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+## Tokenizer
+This model uses a **Morfessor + BPE hybrid tokenizer** designed for Telugu:
+- **Telugu text**: Segmented into morphemes using [Morfessor](https://github.com/aalto-speech/morfessor) with `@@` continuation markers
+- **Non-Telugu text** (English, numbers, URLs): Handled by BPE subword encoding
+- **Fallback**: Character-level encoding for out-of-vocabulary tokens
+**Important**: The tokenizer expects **pre-segmented** input (with `@@` markers). For raw Telugu text, you need to run Morfessor segmentation first.
 ### Full pipeline (raw Telugu text)
 For raw Telugu text, segment with Morfessor first:
 ## License
 Apache 2.0
+## Citation
+If you use this model, please cite:
+```
+@misc{pothana-base-300M,
+  title={Pothana Base 300M: A Telugu Language Model},
+  author={Dvitva AI},
+  year={2025},
+  url={https://huggingface.co/dvitvaai/pothana-base-300M}
+}
+```

tokenizer_class.py CHANGED Viewed

@@ -15,7 +15,6 @@ class TeluguTokenizer(PreTrainedTokenizerFast):
         # Strip @@ continuation markers:
         # "@@ " between tokens means "join to next token" (no space)
         text = text.replace("@@ ", "")
-        # Handle trailing @@ on last token (edge case)
-        if text.endswith("@@"):
-            text = text[:-2]
         return text

         # Strip @@ continuation markers:
         # "@@ " between tokens means "join to next token" (no space)
         text = text.replace("@@ ", "")
+        # Handle remaining @@ (before punctuation, end of string, etc.)
+        text = text.replace("@@", "")
         return text