Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

README.md +152 -0
config.json +30 -0
generation_config.json +13 -0
model.safetensors +3 -0
morfessor_telugu.bin +3 -0
special_tokens_map.json +6 -0
tokenizer.json +0 -0
tokenizer_class.py +20 -0
tokenizer_config.json +23 -0

README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+---
+language:
+  - te
+license: apache-2.0
+tags:
+  - telugu
+  - llama
+  - causal-lm
+  - morfessor
+  - from-scratch
+library_name: transformers
+pipeline_tag: text-generation
+---
+# Pothana Base 300M
+A **345M parameter** LLaMA-style language model trained **from scratch** on Telugu text.
+Named after [Bammera Pothana](https://en.wikipedia.org/wiki/Bammera_Pothana), the celebrated 15th-century Telugu poet who authored the *Andhra Maha Bhagavatamu*.
+Developed by **[Dvitva AI](https://dvitva.ai)**.
+## Model Details
+| | |
+|---|---|
+| **Model** | pothana-base-300M |
+| **Architecture** | LLaMA (RoPE + SwiGLU + RMSNorm) |
+| **Parameters** | 345M |
+| **Hidden size** | 1024 |
+| **Layers** | 20 |
+| **Attention heads** | 16 |
+| **Intermediate size** | 2816 |
+| **Context length** | 2048 |
+| **Vocab size** | 86,075 |
+| **Tokenizer** | Morfessor + BPE (Telugu morpheme-aware) |
+| **Training** | Single GPU, bf16 mixed precision |
+| **Developed by** | [Dvitva AI](https://dvitva.ai) |
+## Quick Start
+### Using pipeline
+```python
+from transformers import pipeline
+pipe = pipeline("text-generation", model="dvitvaai/pothana-base-300M", trust_remote_code=True)
+result = pipe("తెలుగు భాష", max_new_tokens=50, do_sample=True, temperature=0.8)
+print(result[0]["generated_text"])
+```
+> **Note**: `trust_remote_code=True` is required for the custom tokenizer that handles `@@` morpheme joining. Without it, `@@` markers will appear in the output.
+### Manual loading
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model = AutoModelForCausalLM.from_pretrained("dvitvaai/pothana-base-300M")
+tokenizer = AutoTokenizer.from_pretrained("dvitvaai/pothana-base-300M", trust_remote_code=True)
+# Input must be Morfessor-segmented (with @@ continuation markers)
+segmented_text = "తెలుగు భాష చాలా అందమైన@@ ది"
+inputs = tokenizer(segmented_text, return_tensors="pt")
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=100,
+        temperature=0.8,
+        top_k=50,
+        do_sample=True,
+    )
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Tokenizer
+This model uses a **Morfessor + BPE hybrid tokenizer** designed for Telugu:
+- **Telugu text**: Segmented into morphemes using [Morfessor](https://github.com/aalto-speech/morfessor) with `@@` continuation markers
+- **Non-Telugu text** (English, numbers, URLs): Handled by BPE subword encoding
+- **Fallback**: Character-level encoding for out-of-vocabulary tokens
+**Important**: The tokenizer expects **pre-segmented** input (with `@@` markers). For raw Telugu text, you need to run Morfessor segmentation first.
+### Full pipeline (raw Telugu text)
+For raw Telugu text, segment with Morfessor first:
+```python
+import morfessor
+# Load Morfessor model
+io = morfessor.MorfessorIO()
+morf_model = io.read_binary_model_file("morfessor_telugu.bin")
+def segment_telugu(text, separator="@@"):
+    import re
+    TELUGU_RE = re.compile(r"[\u0C00-\u0C7F]+")
+    tokens = []
+    for word in text.split():
+        if TELUGU_RE.fullmatch(word):
+            segments = morf_model.viterbi_segment(word)[0]
+            for i, seg in enumerate(segments):
+                tokens.append(seg + separator if i < len(segments) - 1 else seg)
+        else:
+            tokens.append(word)
+    return " ".join(tokens)
+# Segment, then tokenize and generate
+raw_text = "తెలుగు భాష చాలా అందమైనది"
+segmented = segment_telugu(raw_text)
+inputs = tokenizer(segmented, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Training
+- **Data**: Telugu text corpus (Sangraha dataset)
+- **Preprocessing**: Morfessor morpheme segmentation + BPE for non-Telugu
+- **Optimizer**: AdamW (lr=3e-4, weight_decay=0.1, beta1=0.9, beta2=0.95)
+- **Schedule**: Cosine LR decay with 500-step warmup
+- **Precision**: bf16 mixed precision
+- **Hardware**: Single GPU
+## Limitations
+- This is a **base model** (not instruction-tuned) — it performs text completion, not instruction following
+- The tokenizer requires **Morfessor-segmented input** for best results
+- Trained primarily on Telugu text; limited multilingual capability
+- Small model size (345M) limits reasoning and knowledge capacity
+## License
+Apache 2.0
+## Citation
+If you use this model, please cite:
+```
+@misc{pothana-base-300M,
+  title={Pothana Base 300M: A Telugu Language Model},
+  author={Dvitva AI},
+  year={2025},
+  url={https://huggingface.co/dvitvaai/pothana-base-300M}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "model_type": "llama",
+  "torch_dtype": "float32",
+  "hidden_size": 1024,
+  "intermediate_size": 2816,
+  "num_hidden_layers": 20,
+  "num_attention_heads": 16,
+  "num_key_value_heads": 16,
+  "head_dim": 64,
+  "max_position_embeddings": 2048,
+  "rope_theta": 10000.0,
+  "rope_scaling": null,
+  "rms_norm_eps": 1e-06,
+  "hidden_act": "silu",
+  "attention_bias": false,
+  "mlp_bias": false,
+  "vocab_size": 86075,
+  "tie_word_embeddings": true,
+  "pad_token_id": 0,
+  "bos_token_id": 2,
+  "eos_token_id": 3,
+  "attention_dropout": 0.0,
+  "initializer_range": 0.02,
+  "pretraining_tp": 1,
+  "use_cache": true,
+  "transformers_version": "4.40.0"
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 2,
+  "eos_token_id": 3,
+  "pad_token_id": 0,
+  "do_sample": true,
+  "temperature": 0.8,
+  "top_k": 50,
+  "top_p": 0.95,
+  "max_new_tokens": 200,
+  "repetition_penalty": 1.1,
+  "transformers_version": "4.40.0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:236a8a7692f176c516db8a5c7448795000e1677de1c2798cb75c7d37aa6bee1f
+size 1380356280

morfessor_telugu.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4bd3d98666025b6ad481f92c4e28d4a0b1fe6cdc8f268db6d11cd55367094b11
+size 8652172

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "bos_token": "<bos>",
+  "eos_token": "<eos>",
+  "unk_token": "<unk>",
+  "pad_token": "<pad>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_class.py ADDED Viewed

	@@ -0,0 +1,20 @@

+"""Custom Telugu tokenizer that handles @@ continuation marker stripping."""
+from transformers import PreTrainedTokenizerFast
+class TeluguTokenizer(PreTrainedTokenizerFast):
+    """Telugu tokenizer with Morfessor @@ continuation marker support.
+    Tokens ending with @@ are continuation pieces that join to the next token.
+    This class overrides decode() to strip @@ markers and join morphemes:
+        "రెడ్డి@@ గారు" → "రెడ్డిగారు"
+    """
+    def decode(self, token_ids, skip_special_tokens=False, **kwargs):
+        text = super().decode(token_ids, skip_special_tokens=skip_special_tokens, **kwargs)
+        # Strip @@ continuation markers:
+        # "@@ " between tokens means "join to next token" (no space)
+        text = text.replace("@@ ", "")
+        # Handle remaining @@ (before punctuation, end of string, etc.)
+        text = text.replace("@@", "")
+        return text

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "auto_map": {
+    "AutoTokenizer": [
+      null,
+      "tokenizer_class.TeluguTokenizer"
+    ]
+  },
+  "model_type": "llama",
+  "bos_token": "<bos>",
+  "eos_token": "<eos>",
+  "unk_token": "<unk>",
+  "pad_token": "<pad>",
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "clean_up_tokenization_spaces": false,
+  "model_max_length": 2048,
+  "extra_info": {
+    "type": "morfessor_bpe_telugu",
+    "separator": "@@",
+    "note": "This tokenizer expects Morfessor-segmented text as input. For raw Telugu text, run Morfessor segmentation first using the included morfessor_telugu.bin model. Tokens ending with '@@' are continuation pieces that join to the next token. The decoder handles @@ removal automatically."
+  }
+}