neshkatrapati commited on
Commit
a6034dd
·
verified ·
1 Parent(s): 715004d

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +42 -11
  2. tokenizer_class.py +2 -3
README.md CHANGED
@@ -12,14 +12,19 @@ library_name: transformers
12
  pipeline_tag: text-generation
13
  ---
14
 
15
- # Telugu LLaMA (345M)
16
 
17
  A **345M parameter** LLaMA-style language model trained **from scratch** on Telugu text.
18
 
 
 
 
 
19
  ## Model Details
20
 
21
  | | |
22
  |---|---|
 
23
  | **Architecture** | LLaMA (RoPE + SwiGLU + RMSNorm) |
24
  | **Parameters** | 345M |
25
  | **Hidden size** | 1024 |
@@ -30,27 +35,30 @@ A **345M parameter** LLaMA-style language model trained **from scratch** on Telu
30
  | **Vocab size** | 86,071 |
31
  | **Tokenizer** | Morfessor + BPE (Telugu morpheme-aware) |
32
  | **Training** | Single GPU, bf16 mixed precision |
 
33
 
34
- ## Tokenizer
35
 
36
- This model uses a **Morfessor + BPE hybrid tokenizer** designed for Telugu:
37
 
38
- - **Telugu text**: Segmented into morphemes using [Morfessor](https://github.com/aalto-speech/morfessor) with `@@` continuation markers
39
- - **Non-Telugu text** (English, numbers, URLs): Handled by BPE subword encoding
40
- - **Fallback**: Character-level encoding for out-of-vocabulary tokens
41
 
42
- **Important**: The tokenizer expects **pre-segmented** input (with `@@` markers). For raw Telugu text, you need to run Morfessor segmentation first.
 
 
 
43
 
44
- ## Usage
45
 
46
- ### Basic usage (with pre-segmented text)
47
 
48
  ```python
49
  from transformers import AutoModelForCausalLM, AutoTokenizer
50
  import torch
51
 
52
- model = AutoModelForCausalLM.from_pretrained("YOUR_USERNAME/telugu-llama-345m")
53
- tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/telugu-llama-345m", trust_remote_code=True)
54
 
55
  # Input must be Morfessor-segmented (with @@ continuation markers)
56
  segmented_text = "తెలుగు భాష చాలా అందమైన@@ ది"
@@ -68,6 +76,16 @@ with torch.no_grad():
68
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
69
  ```
70
 
 
 
 
 
 
 
 
 
 
 
71
  ### Full pipeline (raw Telugu text)
72
 
73
  For raw Telugu text, segment with Morfessor first:
@@ -119,3 +137,16 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
119
  ## License
120
 
121
  Apache 2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  pipeline_tag: text-generation
13
  ---
14
 
15
+ # Pothana Base 300M
16
 
17
  A **345M parameter** LLaMA-style language model trained **from scratch** on Telugu text.
18
 
19
+ Named after [Bammera Pothana](https://en.wikipedia.org/wiki/Bammera_Pothana), the celebrated 15th-century Telugu poet who authored the *Andhra Maha Bhagavatamu*.
20
+
21
+ Developed by **[Dvitva AI](https://dvitva.ai)**.
22
+
23
  ## Model Details
24
 
25
  | | |
26
  |---|---|
27
+ | **Model** | pothana-base-300M |
28
  | **Architecture** | LLaMA (RoPE + SwiGLU + RMSNorm) |
29
  | **Parameters** | 345M |
30
  | **Hidden size** | 1024 |
 
35
  | **Vocab size** | 86,071 |
36
  | **Tokenizer** | Morfessor + BPE (Telugu morpheme-aware) |
37
  | **Training** | Single GPU, bf16 mixed precision |
38
+ | **Developed by** | [Dvitva AI](https://dvitva.ai) |
39
 
40
+ ## Quick Start
41
 
42
+ ### Using pipeline
43
 
44
+ ```python
45
+ from transformers import pipeline
 
46
 
47
+ pipe = pipeline("text-generation", model="dvitvaai/pothana-base-300M", trust_remote_code=True)
48
+ result = pipe("తెలుగు భాష", max_new_tokens=50, do_sample=True, temperature=0.8)
49
+ print(result[0]["generated_text"])
50
+ ```
51
 
52
+ > **Note**: `trust_remote_code=True` is required for the custom tokenizer that handles `@@` morpheme joining. Without it, `@@` markers will appear in the output.
53
 
54
+ ### Manual loading
55
 
56
  ```python
57
  from transformers import AutoModelForCausalLM, AutoTokenizer
58
  import torch
59
 
60
+ model = AutoModelForCausalLM.from_pretrained("dvitvaai/pothana-base-300M")
61
+ tokenizer = AutoTokenizer.from_pretrained("dvitvaai/pothana-base-300M", trust_remote_code=True)
62
 
63
  # Input must be Morfessor-segmented (with @@ continuation markers)
64
  segmented_text = "తెలుగు భాష చాలా అందమైన@@ ది"
 
76
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
77
  ```
78
 
79
+ ## Tokenizer
80
+
81
+ This model uses a **Morfessor + BPE hybrid tokenizer** designed for Telugu:
82
+
83
+ - **Telugu text**: Segmented into morphemes using [Morfessor](https://github.com/aalto-speech/morfessor) with `@@` continuation markers
84
+ - **Non-Telugu text** (English, numbers, URLs): Handled by BPE subword encoding
85
+ - **Fallback**: Character-level encoding for out-of-vocabulary tokens
86
+
87
+ **Important**: The tokenizer expects **pre-segmented** input (with `@@` markers). For raw Telugu text, you need to run Morfessor segmentation first.
88
+
89
  ### Full pipeline (raw Telugu text)
90
 
91
  For raw Telugu text, segment with Morfessor first:
 
137
  ## License
138
 
139
  Apache 2.0
140
+
141
+ ## Citation
142
+
143
+ If you use this model, please cite:
144
+
145
+ ```
146
+ @misc{pothana-base-300M,
147
+ title={Pothana Base 300M: A Telugu Language Model},
148
+ author={Dvitva AI},
149
+ year={2025},
150
+ url={https://huggingface.co/dvitvaai/pothana-base-300M}
151
+ }
152
+ ```
tokenizer_class.py CHANGED
@@ -15,7 +15,6 @@ class TeluguTokenizer(PreTrainedTokenizerFast):
15
  # Strip @@ continuation markers:
16
  # "@@ " between tokens means "join to next token" (no space)
17
  text = text.replace("@@ ", "")
18
- # Handle trailing @@ on last token (edge case)
19
- if text.endswith("@@"):
20
- text = text[:-2]
21
  return text
 
15
  # Strip @@ continuation markers:
16
  # "@@ " between tokens means "join to next token" (no space)
17
  text = text.replace("@@ ", "")
18
+ # Handle remaining @@ (before punctuation, end of string, etc.)
19
+ text = text.replace("@@", "")
 
20
  return text