Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

README.md +113 -191
config.json +19 -21
generation_config.json +5 -5
model.safetensors +2 -2
morfessor_telugu.bin +3 -0
special_tokens_map.json +6 -0
tokenizer.json +0 -0
tokenizer_class.py +21 -0
tokenizer_config.json +18 -12

README.md CHANGED Viewed

@@ -1,199 +1,121 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+language:
+  - te
+license: apache-2.0
+tags:
+  - telugu
+  - llama
+  - causal-lm
+  - morfessor
+  - from-scratch
 library_name: transformers
+pipeline_tag: text-generation
 ---
+# Telugu LLaMA (345M)
+A **345M parameter** LLaMA-style language model trained **from scratch** on Telugu text.
 ## Model Details
+| | |
+|---|---|
+| **Architecture** | LLaMA (RoPE + SwiGLU + RMSNorm) |
+| **Parameters** | 345M |
+| **Hidden size** | 1024 |
+| **Layers** | 20 |
+| **Attention heads** | 16 |
+| **Intermediate size** | 2816 |
+| **Context length** | 2048 |
+| **Vocab size** | 86,071 |
+| **Tokenizer** | Morfessor + BPE (Telugu morpheme-aware) |
+| **Training** | Single GPU, bf16 mixed precision |
+## Tokenizer
+This model uses a **Morfessor + BPE hybrid tokenizer** designed for Telugu:
+- **Telugu text**: Segmented into morphemes using [Morfessor](https://github.com/aalto-speech/morfessor) with `@@` continuation markers
+- **Non-Telugu text** (English, numbers, URLs): Handled by BPE subword encoding
+- **Fallback**: Character-level encoding for out-of-vocabulary tokens
+**Important**: The tokenizer expects **pre-segmented** input (with `@@` markers). For raw Telugu text, you need to run Morfessor segmentation first.
+## Usage
+### Basic usage (with pre-segmented text)
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model = AutoModelForCausalLM.from_pretrained("YOUR_USERNAME/telugu-llama-345m")
+tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/telugu-llama-345m", trust_remote_code=True)
+# Input must be Morfessor-segmented (with @@ continuation markers)
+segmented_text = "తెలుగు భాష చాలా అందమైన@@ ది"
+inputs = tokenizer(segmented_text, return_tensors="pt")
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=100,
+        temperature=0.8,
+        top_k=50,
+        do_sample=True,
+    )
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+### Full pipeline (raw Telugu text)
+For raw Telugu text, segment with Morfessor first:
+```python
+import morfessor
+# Load Morfessor model
+io = morfessor.MorfessorIO()
+morf_model = io.read_binary_model_file("morfessor_telugu.bin")
+def segment_telugu(text, separator="@@"):
+    import re
+    TELUGU_RE = re.compile(r"[\u0C00-\u0C7F]+")
+    tokens = []
+    for word in text.split():
+        if TELUGU_RE.fullmatch(word):
+            segments = morf_model.viterbi_segment(word)[0]
+            for i, seg in enumerate(segments):
+                tokens.append(seg + separator if i < len(segments) - 1 else seg)
+        else:
+            tokens.append(word)
+    return " ".join(tokens)
+# Segment, then tokenize and generate
+raw_text = "తెలుగు భాష చాలా అందమైనది"
+segmented = segment_telugu(raw_text)
+inputs = tokenizer(segmented, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Training
+- **Data**: Telugu text corpus (Sangraha dataset)
+- **Preprocessing**: Morfessor morpheme segmentation + BPE for non-Telugu
+- **Optimizer**: AdamW (lr=3e-4, weight_decay=0.1, beta1=0.9, beta2=0.95)
+- **Schedule**: Cosine LR decay with 500-step warmup
+- **Precision**: bf16 mixed precision
+- **Hardware**: Single GPU
+## Limitations
+- This is a **base model** (not instruction-tuned) — it performs text completion, not instruction following
+- The tokenizer requires **Morfessor-segmented input** for best results
+- Trained primarily on Telugu text; limited multilingual capability
+- Small model size (345M) limits reasoning and knowledge capacity
+## License
+Apache 2.0

config.json CHANGED Viewed

@@ -2,31 +2,29 @@
   "architectures": [
     "LlamaForCausalLM"
   ],
-  "attention_bias": false,
-  "attention_dropout": 0.0,
-  "bos_token_id": 2,
-  "dtype": "float32",
-  "eos_token_id": 3,
-  "head_dim": 64,
-  "hidden_act": "silu",
   "hidden_size": 1024,
-  "initializer_range": 0.02,
   "intermediate_size": 2816,
-  "max_position_embeddings": 2048,
-  "mlp_bias": false,
-  "model_type": "llama",
-  "num_attention_heads": 16,
   "num_hidden_layers": 20,
   "num_key_value_heads": 16,
-  "pad_token_id": 0,
-  "pretraining_tp": 1,
   "rms_norm_eps": 1e-06,
-  "rope_parameters": {
-    "rope_theta": 10000.0,
-    "rope_type": "default"
-  },
   "tie_word_embeddings": true,
-  "transformers_version": "5.1.0",
   "use_cache": true,
-  "vocab_size": 86097
-}

   "architectures": [
     "LlamaForCausalLM"
   ],
+  "model_type": "llama",
+  "torch_dtype": "float32",
   "hidden_size": 1024,
   "intermediate_size": 2816,
   "num_hidden_layers": 20,
+  "num_attention_heads": 16,
   "num_key_value_heads": 16,
+  "head_dim": 64,
+  "max_position_embeddings": 2048,
+  "rope_theta": 10000.0,
+  "rope_scaling": null,
   "rms_norm_eps": 1e-06,
+  "hidden_act": "silu",
+  "attention_bias": false,
+  "mlp_bias": false,
+  "vocab_size": 86071,
   "tie_word_embeddings": true,
+  "pad_token_id": 0,
+  "bos_token_id": 2,
+  "eos_token_id": 3,
+  "attention_dropout": 0.0,
+  "initializer_range": 0.02,
+  "pretraining_tp": 1,
   "use_cache": true,
+  "transformers_version": "4.40.0"
+}

generation_config.json CHANGED Viewed

@@ -1,13 +1,13 @@
 {
   "_from_model_config": true,
   "bos_token_id": 2,
-  "do_sample": true,
   "eos_token_id": 3,
-  "max_new_tokens": 200,
   "pad_token_id": 0,
-  "repetition_penalty": 1.1,
   "temperature": 0.8,
   "top_k": 50,
   "top_p": 0.95,
-  "transformers_version": "5.1.0"
-}

 {
   "_from_model_config": true,
   "bos_token_id": 2,
   "eos_token_id": 3,
   "pad_token_id": 0,
+  "do_sample": true,
   "temperature": 0.8,
   "top_k": 50,
   "top_p": 0.95,
+  "max_new_tokens": 200,
+  "repetition_penalty": 1.1,
+  "transformers_version": "4.40.0"
+}

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:dc06d19946fdcce9ece4f0315227ca581f7e6d71ede1d29f06f20a54ffd960c6
-size 1380446424

 version https://git-lfs.github.com/spec/v1
+oid sha256:1d69ca1354ae042dedbffe8e81be61e706fbf8dd856e80e1bf02be9cec903f74
+size 1380339896

morfessor_telugu.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4bd3d98666025b6ad481f92c4e28d4a0b1fe6cdc8f268db6d11cd55367094b11
+size 8652172

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "bos_token": "<bos>",
+  "eos_token": "<eos>",
+  "unk_token": "<unk>",
+  "pad_token": "<pad>"
+}

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_class.py ADDED Viewed

	@@ -0,0 +1,21 @@

+"""Custom Telugu tokenizer that handles @@ continuation marker stripping."""
+from transformers import PreTrainedTokenizerFast
+class TeluguTokenizer(PreTrainedTokenizerFast):
+    """Telugu tokenizer with Morfessor @@ continuation marker support.
+    Tokens ending with @@ are continuation pieces that join to the next token.
+    This class overrides decode() to strip @@ markers and join morphemes:
+        "రెడ్డి@@ గారు" → "రెడ్డిగారు"
+    """
+    def decode(self, token_ids, skip_special_tokens=False, **kwargs):
+        text = super().decode(token_ids, skip_special_tokens=skip_special_tokens, **kwargs)
+        # Strip @@ continuation markers:
+        # "@@ " between tokens means "join to next token" (no space)
+        text = text.replace("@@ ", "")
+        # Handle trailing @@ on last token (edge case)
+        if text.endswith("@@"):
+            text = text[:-2]
+        return text

tokenizer_config.json CHANGED Viewed

@@ -1,17 +1,23 @@
 {
-  "backend": "tokenizers",
   "bos_token": "<bos>",
-  "clean_up_tokenization_spaces": false,
   "eos_token": "<eos>",
   "extra_info": {
-    "note": "This tokenizer expects Morfessor-segmented text as input. For raw Telugu text, run Morfessor segmentation first using the included morfessor_telugu.bin model. Tokens ending with '@@' are continuation pieces that join to the next token. The decoder handles @@ removal automatically.",
     "separator": "@@",
-    "type": "morfessor_bpe_telugu"
-  },
-  "is_local": true,
-  "model_max_length": 2048,
-  "model_type": "llama",
-  "pad_token": "<pad>",
-  "tokenizer_class": "TokenizersBackend",
-  "unk_token": "<unk>"
-}

 {
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "auto_map": {
+    "AutoTokenizer": [
+      null,
+      "tokenizer_class.TeluguTokenizer"
+    ]
+  },
+  "model_type": "llama",
   "bos_token": "<bos>",
   "eos_token": "<eos>",
+  "unk_token": "<unk>",
+  "pad_token": "<pad>",
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "clean_up_tokenization_spaces": false,
+  "model_max_length": 2048,
   "extra_info": {
+    "type": "morfessor_bpe_telugu",
     "separator": "@@",
+    "note": "This tokenizer expects Morfessor-segmented text as input. For raw Telugu text, run Morfessor segmentation first using the included morfessor_telugu.bin model. Tokens ending with '@@' are continuation pieces that join to the next token. The decoder handles @@ removal automatically."
+  }
+}