neshkatrapati commited on
Commit
715004d
·
verified ·
1 Parent(s): a7f832e

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,199 +1,121 @@
1
  ---
 
 
 
 
 
 
 
 
 
2
  library_name: transformers
3
- tags: []
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
 
 
11
 
12
  ## Model Details
13
 
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
1
  ---
2
+ language:
3
+ - te
4
+ license: apache-2.0
5
+ tags:
6
+ - telugu
7
+ - llama
8
+ - causal-lm
9
+ - morfessor
10
+ - from-scratch
11
  library_name: transformers
12
+ pipeline_tag: text-generation
13
  ---
14
 
15
+ # Telugu LLaMA (345M)
 
 
 
16
 
17
+ A **345M parameter** LLaMA-style language model trained **from scratch** on Telugu text.
18
 
19
  ## Model Details
20
 
21
+ | | |
22
+ |---|---|
23
+ | **Architecture** | LLaMA (RoPE + SwiGLU + RMSNorm) |
24
+ | **Parameters** | 345M |
25
+ | **Hidden size** | 1024 |
26
+ | **Layers** | 20 |
27
+ | **Attention heads** | 16 |
28
+ | **Intermediate size** | 2816 |
29
+ | **Context length** | 2048 |
30
+ | **Vocab size** | 86,071 |
31
+ | **Tokenizer** | Morfessor + BPE (Telugu morpheme-aware) |
32
+ | **Training** | Single GPU, bf16 mixed precision |
33
+
34
+ ## Tokenizer
35
+
36
+ This model uses a **Morfessor + BPE hybrid tokenizer** designed for Telugu:
37
+
38
+ - **Telugu text**: Segmented into morphemes using [Morfessor](https://github.com/aalto-speech/morfessor) with `@@` continuation markers
39
+ - **Non-Telugu text** (English, numbers, URLs): Handled by BPE subword encoding
40
+ - **Fallback**: Character-level encoding for out-of-vocabulary tokens
41
+
42
+ **Important**: The tokenizer expects **pre-segmented** input (with `@@` markers). For raw Telugu text, you need to run Morfessor segmentation first.
43
+
44
+ ## Usage
45
+
46
+ ### Basic usage (with pre-segmented text)
47
+
48
+ ```python
49
+ from transformers import AutoModelForCausalLM, AutoTokenizer
50
+ import torch
51
+
52
+ model = AutoModelForCausalLM.from_pretrained("YOUR_USERNAME/telugu-llama-345m")
53
+ tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/telugu-llama-345m", trust_remote_code=True)
54
+
55
+ # Input must be Morfessor-segmented (with @@ continuation markers)
56
+ segmented_text = "తెలుగు భాష చాలా అందమైన@@ ది"
57
+ inputs = tokenizer(segmented_text, return_tensors="pt")
58
+
59
+ with torch.no_grad():
60
+ outputs = model.generate(
61
+ **inputs,
62
+ max_new_tokens=100,
63
+ temperature=0.8,
64
+ top_k=50,
65
+ do_sample=True,
66
+ )
67
+
68
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
69
+ ```
70
+
71
+ ### Full pipeline (raw Telugu text)
72
+
73
+ For raw Telugu text, segment with Morfessor first:
74
+
75
+ ```python
76
+ import morfessor
77
+
78
+ # Load Morfessor model
79
+ io = morfessor.MorfessorIO()
80
+ morf_model = io.read_binary_model_file("morfessor_telugu.bin")
81
+
82
+ def segment_telugu(text, separator="@@"):
83
+ import re
84
+ TELUGU_RE = re.compile(r"[\u0C00-\u0C7F]+")
85
+ tokens = []
86
+ for word in text.split():
87
+ if TELUGU_RE.fullmatch(word):
88
+ segments = morf_model.viterbi_segment(word)[0]
89
+ for i, seg in enumerate(segments):
90
+ tokens.append(seg + separator if i < len(segments) - 1 else seg)
91
+ else:
92
+ tokens.append(word)
93
+ return " ".join(tokens)
94
+
95
+ # Segment, then tokenize and generate
96
+ raw_text = "తెలుగు భాష చాలా అందమైనది"
97
+ segmented = segment_telugu(raw_text)
98
+ inputs = tokenizer(segmented, return_tensors="pt")
99
+ outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)
100
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
101
+ ```
102
+
103
+ ## Training
104
+
105
+ - **Data**: Telugu text corpus (Sangraha dataset)
106
+ - **Preprocessing**: Morfessor morpheme segmentation + BPE for non-Telugu
107
+ - **Optimizer**: AdamW (lr=3e-4, weight_decay=0.1, beta1=0.9, beta2=0.95)
108
+ - **Schedule**: Cosine LR decay with 500-step warmup
109
+ - **Precision**: bf16 mixed precision
110
+ - **Hardware**: Single GPU
111
+
112
+ ## Limitations
113
+
114
+ - This is a **base model** (not instruction-tuned) — it performs text completion, not instruction following
115
+ - The tokenizer requires **Morfessor-segmented input** for best results
116
+ - Trained primarily on Telugu text; limited multilingual capability
117
+ - Small model size (345M) limits reasoning and knowledge capacity
118
+
119
+ ## License
120
+
121
+ Apache 2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -2,31 +2,29 @@
2
  "architectures": [
3
  "LlamaForCausalLM"
4
  ],
5
- "attention_bias": false,
6
- "attention_dropout": 0.0,
7
- "bos_token_id": 2,
8
- "dtype": "float32",
9
- "eos_token_id": 3,
10
- "head_dim": 64,
11
- "hidden_act": "silu",
12
  "hidden_size": 1024,
13
- "initializer_range": 0.02,
14
  "intermediate_size": 2816,
15
- "max_position_embeddings": 2048,
16
- "mlp_bias": false,
17
- "model_type": "llama",
18
- "num_attention_heads": 16,
19
  "num_hidden_layers": 20,
 
20
  "num_key_value_heads": 16,
21
- "pad_token_id": 0,
22
- "pretraining_tp": 1,
 
 
23
  "rms_norm_eps": 1e-06,
24
- "rope_parameters": {
25
- "rope_theta": 10000.0,
26
- "rope_type": "default"
27
- },
28
  "tie_word_embeddings": true,
29
- "transformers_version": "5.1.0",
 
 
 
 
 
30
  "use_cache": true,
31
- "vocab_size": 86097
32
- }
 
2
  "architectures": [
3
  "LlamaForCausalLM"
4
  ],
5
+ "model_type": "llama",
6
+ "torch_dtype": "float32",
 
 
 
 
 
7
  "hidden_size": 1024,
 
8
  "intermediate_size": 2816,
 
 
 
 
9
  "num_hidden_layers": 20,
10
+ "num_attention_heads": 16,
11
  "num_key_value_heads": 16,
12
+ "head_dim": 64,
13
+ "max_position_embeddings": 2048,
14
+ "rope_theta": 10000.0,
15
+ "rope_scaling": null,
16
  "rms_norm_eps": 1e-06,
17
+ "hidden_act": "silu",
18
+ "attention_bias": false,
19
+ "mlp_bias": false,
20
+ "vocab_size": 86071,
21
  "tie_word_embeddings": true,
22
+ "pad_token_id": 0,
23
+ "bos_token_id": 2,
24
+ "eos_token_id": 3,
25
+ "attention_dropout": 0.0,
26
+ "initializer_range": 0.02,
27
+ "pretraining_tp": 1,
28
  "use_cache": true,
29
+ "transformers_version": "4.40.0"
30
+ }
generation_config.json CHANGED
@@ -1,13 +1,13 @@
1
  {
2
  "_from_model_config": true,
3
  "bos_token_id": 2,
4
- "do_sample": true,
5
  "eos_token_id": 3,
6
- "max_new_tokens": 200,
7
  "pad_token_id": 0,
8
- "repetition_penalty": 1.1,
9
  "temperature": 0.8,
10
  "top_k": 50,
11
  "top_p": 0.95,
12
- "transformers_version": "5.1.0"
13
- }
 
 
 
1
  {
2
  "_from_model_config": true,
3
  "bos_token_id": 2,
 
4
  "eos_token_id": 3,
 
5
  "pad_token_id": 0,
6
+ "do_sample": true,
7
  "temperature": 0.8,
8
  "top_k": 50,
9
  "top_p": 0.95,
10
+ "max_new_tokens": 200,
11
+ "repetition_penalty": 1.1,
12
+ "transformers_version": "4.40.0"
13
+ }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:dc06d19946fdcce9ece4f0315227ca581f7e6d71ede1d29f06f20a54ffd960c6
3
- size 1380446424
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1d69ca1354ae042dedbffe8e81be61e706fbf8dd856e80e1bf02be9cec903f74
3
+ size 1380339896
morfessor_telugu.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bd3d98666025b6ad481f92c4e28d4a0b1fe6cdc8f268db6d11cd55367094b11
3
+ size 8652172
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<bos>",
3
+ "eos_token": "<eos>",
4
+ "unk_token": "<unk>",
5
+ "pad_token": "<pad>"
6
+ }
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_class.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Custom Telugu tokenizer that handles @@ continuation marker stripping."""
2
+ from transformers import PreTrainedTokenizerFast
3
+
4
+
5
+ class TeluguTokenizer(PreTrainedTokenizerFast):
6
+ """Telugu tokenizer with Morfessor @@ continuation marker support.
7
+
8
+ Tokens ending with @@ are continuation pieces that join to the next token.
9
+ This class overrides decode() to strip @@ markers and join morphemes:
10
+ "రెడ్డి@@ గారు" → "రెడ్డిగారు"
11
+ """
12
+
13
+ def decode(self, token_ids, skip_special_tokens=False, **kwargs):
14
+ text = super().decode(token_ids, skip_special_tokens=skip_special_tokens, **kwargs)
15
+ # Strip @@ continuation markers:
16
+ # "@@ " between tokens means "join to next token" (no space)
17
+ text = text.replace("@@ ", "")
18
+ # Handle trailing @@ on last token (edge case)
19
+ if text.endswith("@@"):
20
+ text = text[:-2]
21
+ return text
tokenizer_config.json CHANGED
@@ -1,17 +1,23 @@
1
  {
2
- "backend": "tokenizers",
 
 
 
 
 
 
 
3
  "bos_token": "<bos>",
4
- "clean_up_tokenization_spaces": false,
5
  "eos_token": "<eos>",
 
 
 
 
 
 
6
  "extra_info": {
7
- "note": "This tokenizer expects Morfessor-segmented text as input. For raw Telugu text, run Morfessor segmentation first using the included morfessor_telugu.bin model. Tokens ending with '@@' are continuation pieces that join to the next token. The decoder handles @@ removal automatically.",
8
  "separator": "@@",
9
- "type": "morfessor_bpe_telugu"
10
- },
11
- "is_local": true,
12
- "model_max_length": 2048,
13
- "model_type": "llama",
14
- "pad_token": "<pad>",
15
- "tokenizer_class": "TokenizersBackend",
16
- "unk_token": "<unk>"
17
- }
 
1
  {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
+ "auto_map": {
4
+ "AutoTokenizer": [
5
+ null,
6
+ "tokenizer_class.TeluguTokenizer"
7
+ ]
8
+ },
9
+ "model_type": "llama",
10
  "bos_token": "<bos>",
 
11
  "eos_token": "<eos>",
12
+ "unk_token": "<unk>",
13
+ "pad_token": "<pad>",
14
+ "add_bos_token": true,
15
+ "add_eos_token": false,
16
+ "clean_up_tokenization_spaces": false,
17
+ "model_max_length": 2048,
18
  "extra_info": {
19
+ "type": "morfessor_bpe_telugu",
20
  "separator": "@@",
21
+ "note": "This tokenizer expects Morfessor-segmented text as input. For raw Telugu text, run Morfessor segmentation first using the included morfessor_telugu.bin model. Tokens ending with '@@' are continuation pieces that join to the next token. The decoder handles @@ removal automatically."
22
+ }
23
+ }