neshkatrapati commited on
Commit
bc96c4b
·
verified ·
1 Parent(s): a373109

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - te
4
+ license: apache-2.0
5
+ tags:
6
+ - telugu
7
+ - llama
8
+ - causal-lm
9
+ - morfessor
10
+ - from-scratch
11
+ library_name: transformers
12
+ pipeline_tag: text-generation
13
+ ---
14
+
15
+ # Pothana Base 300M
16
+
17
+ A **345M parameter** LLaMA-style language model trained **from scratch** on Telugu text.
18
+
19
+ Named after [Bammera Pothana](https://en.wikipedia.org/wiki/Bammera_Pothana), the celebrated 15th-century Telugu poet who authored the *Andhra Maha Bhagavatamu*.
20
+
21
+ Developed by **[Dvitva AI](https://dvitva.ai)**.
22
+
23
+ ## Model Details
24
+
25
+ | | |
26
+ |---|---|
27
+ | **Model** | pothana-base-300M |
28
+ | **Architecture** | LLaMA (RoPE + SwiGLU + RMSNorm) |
29
+ | **Parameters** | 345M |
30
+ | **Hidden size** | 1024 |
31
+ | **Layers** | 20 |
32
+ | **Attention heads** | 16 |
33
+ | **Intermediate size** | 2816 |
34
+ | **Context length** | 2048 |
35
+ | **Vocab size** | 86,075 |
36
+ | **Tokenizer** | Morfessor + BPE (Telugu morpheme-aware) |
37
+ | **Training** | Single GPU, bf16 mixed precision |
38
+ | **Developed by** | [Dvitva AI](https://dvitva.ai) |
39
+
40
+ ## Quick Start
41
+
42
+ ### Using pipeline
43
+
44
+ ```python
45
+ from transformers import pipeline
46
+
47
+ pipe = pipeline("text-generation", model="dvitvaai/pothana-base-300M", trust_remote_code=True)
48
+ result = pipe("తెలుగు భాష", max_new_tokens=50, do_sample=True, temperature=0.8)
49
+ print(result[0]["generated_text"])
50
+ ```
51
+
52
+ > **Note**: `trust_remote_code=True` is required for the custom tokenizer that handles `@@` morpheme joining. Without it, `@@` markers will appear in the output.
53
+
54
+ ### Manual loading
55
+
56
+ ```python
57
+ from transformers import AutoModelForCausalLM, AutoTokenizer
58
+ import torch
59
+
60
+ model = AutoModelForCausalLM.from_pretrained("dvitvaai/pothana-base-300M")
61
+ tokenizer = AutoTokenizer.from_pretrained("dvitvaai/pothana-base-300M", trust_remote_code=True)
62
+
63
+ # Input must be Morfessor-segmented (with @@ continuation markers)
64
+ segmented_text = "తెలుగు భాష చాలా అందమైన@@ ది"
65
+ inputs = tokenizer(segmented_text, return_tensors="pt")
66
+
67
+ with torch.no_grad():
68
+ outputs = model.generate(
69
+ **inputs,
70
+ max_new_tokens=100,
71
+ temperature=0.8,
72
+ top_k=50,
73
+ do_sample=True,
74
+ )
75
+
76
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
77
+ ```
78
+
79
+ ## Tokenizer
80
+
81
+ This model uses a **Morfessor + BPE hybrid tokenizer** designed for Telugu:
82
+
83
+ - **Telugu text**: Segmented into morphemes using [Morfessor](https://github.com/aalto-speech/morfessor) with `@@` continuation markers
84
+ - **Non-Telugu text** (English, numbers, URLs): Handled by BPE subword encoding
85
+ - **Fallback**: Character-level encoding for out-of-vocabulary tokens
86
+
87
+ **Important**: The tokenizer expects **pre-segmented** input (with `@@` markers). For raw Telugu text, you need to run Morfessor segmentation first.
88
+
89
+ ### Full pipeline (raw Telugu text)
90
+
91
+ For raw Telugu text, segment with Morfessor first:
92
+
93
+ ```python
94
+ import morfessor
95
+
96
+ # Load Morfessor model
97
+ io = morfessor.MorfessorIO()
98
+ morf_model = io.read_binary_model_file("morfessor_telugu.bin")
99
+
100
+ def segment_telugu(text, separator="@@"):
101
+ import re
102
+ TELUGU_RE = re.compile(r"[\u0C00-\u0C7F]+")
103
+ tokens = []
104
+ for word in text.split():
105
+ if TELUGU_RE.fullmatch(word):
106
+ segments = morf_model.viterbi_segment(word)[0]
107
+ for i, seg in enumerate(segments):
108
+ tokens.append(seg + separator if i < len(segments) - 1 else seg)
109
+ else:
110
+ tokens.append(word)
111
+ return " ".join(tokens)
112
+
113
+ # Segment, then tokenize and generate
114
+ raw_text = "తెలుగు భాష చాలా అందమైనది"
115
+ segmented = segment_telugu(raw_text)
116
+ inputs = tokenizer(segmented, return_tensors="pt")
117
+ outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)
118
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
119
+ ```
120
+
121
+ ## Training
122
+
123
+ - **Data**: Telugu text corpus (Sangraha dataset)
124
+ - **Preprocessing**: Morfessor morpheme segmentation + BPE for non-Telugu
125
+ - **Optimizer**: AdamW (lr=3e-4, weight_decay=0.1, beta1=0.9, beta2=0.95)
126
+ - **Schedule**: Cosine LR decay with 500-step warmup
127
+ - **Precision**: bf16 mixed precision
128
+ - **Hardware**: Single GPU
129
+
130
+ ## Limitations
131
+
132
+ - This is a **base model** (not instruction-tuned) — it performs text completion, not instruction following
133
+ - The tokenizer requires **Morfessor-segmented input** for best results
134
+ - Trained primarily on Telugu text; limited multilingual capability
135
+ - Small model size (345M) limits reasoning and knowledge capacity
136
+
137
+ ## License
138
+
139
+ Apache 2.0
140
+
141
+ ## Citation
142
+
143
+ If you use this model, please cite:
144
+
145
+ ```
146
+ @misc{pothana-base-300M,
147
+ title={Pothana Base 300M: A Telugu Language Model},
148
+ author={Dvitva AI},
149
+ year={2025},
150
+ url={https://huggingface.co/dvitvaai/pothana-base-300M}
151
+ }
152
+ ```
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "model_type": "llama",
6
+ "torch_dtype": "float32",
7
+ "hidden_size": 1024,
8
+ "intermediate_size": 2816,
9
+ "num_hidden_layers": 20,
10
+ "num_attention_heads": 16,
11
+ "num_key_value_heads": 16,
12
+ "head_dim": 64,
13
+ "max_position_embeddings": 2048,
14
+ "rope_theta": 10000.0,
15
+ "rope_scaling": null,
16
+ "rms_norm_eps": 1e-06,
17
+ "hidden_act": "silu",
18
+ "attention_bias": false,
19
+ "mlp_bias": false,
20
+ "vocab_size": 86075,
21
+ "tie_word_embeddings": true,
22
+ "pad_token_id": 0,
23
+ "bos_token_id": 2,
24
+ "eos_token_id": 3,
25
+ "attention_dropout": 0.0,
26
+ "initializer_range": 0.02,
27
+ "pretraining_tp": 1,
28
+ "use_cache": true,
29
+ "transformers_version": "4.40.0"
30
+ }
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 2,
4
+ "eos_token_id": 3,
5
+ "pad_token_id": 0,
6
+ "do_sample": true,
7
+ "temperature": 0.8,
8
+ "top_k": 50,
9
+ "top_p": 0.95,
10
+ "max_new_tokens": 200,
11
+ "repetition_penalty": 1.1,
12
+ "transformers_version": "4.40.0"
13
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:236a8a7692f176c516db8a5c7448795000e1677de1c2798cb75c7d37aa6bee1f
3
+ size 1380356280
morfessor_telugu.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bd3d98666025b6ad481f92c4e28d4a0b1fe6cdc8f268db6d11cd55367094b11
3
+ size 8652172
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<bos>",
3
+ "eos_token": "<eos>",
4
+ "unk_token": "<unk>",
5
+ "pad_token": "<pad>"
6
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_class.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Custom Telugu tokenizer that handles @@ continuation marker stripping."""
2
+ from transformers import PreTrainedTokenizerFast
3
+
4
+
5
+ class TeluguTokenizer(PreTrainedTokenizerFast):
6
+ """Telugu tokenizer with Morfessor @@ continuation marker support.
7
+
8
+ Tokens ending with @@ are continuation pieces that join to the next token.
9
+ This class overrides decode() to strip @@ markers and join morphemes:
10
+ "రెడ్డి@@ గారు" → "రెడ్డిగారు"
11
+ """
12
+
13
+ def decode(self, token_ids, skip_special_tokens=False, **kwargs):
14
+ text = super().decode(token_ids, skip_special_tokens=skip_special_tokens, **kwargs)
15
+ # Strip @@ continuation markers:
16
+ # "@@ " between tokens means "join to next token" (no space)
17
+ text = text.replace("@@ ", "")
18
+ # Handle remaining @@ (before punctuation, end of string, etc.)
19
+ text = text.replace("@@", "")
20
+ return text
tokenizer_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
+ "auto_map": {
4
+ "AutoTokenizer": [
5
+ null,
6
+ "tokenizer_class.TeluguTokenizer"
7
+ ]
8
+ },
9
+ "model_type": "llama",
10
+ "bos_token": "<bos>",
11
+ "eos_token": "<eos>",
12
+ "unk_token": "<unk>",
13
+ "pad_token": "<pad>",
14
+ "add_bos_token": true,
15
+ "add_eos_token": false,
16
+ "clean_up_tokenization_spaces": false,
17
+ "model_max_length": 2048,
18
+ "extra_info": {
19
+ "type": "morfessor_bpe_telugu",
20
+ "separator": "@@",
21
+ "note": "This tokenizer expects Morfessor-segmented text as input. For raw Telugu text, run Morfessor segmentation first using the included morfessor_telugu.bin model. Tokens ending with '@@' are continuation pieces that join to the next token. The decoder handles @@ removal automatically."
22
+ }
23
+ }