PredictiveManish
/

Trimurti-LM

@@ -1,4 +1,120 @@
 ---
 datasets:
 - ai4bharat/samanantar
 - PredictiveManish/multilingual-corpus
@@ -9,4 +125,16 @@ language:
 metrics:
 - accuracy
 pipeline_tag: text-generation
 ---

 ---
+---
+license: apache-2.0
+tags:
+- multilingual
+- text-generation
+- indic-languages
+- hindi
+- punjabi
+- small-model
+pipeline_tag: text-generation
+widget:
+- text: "[EN] The weather today is"
+  example_title: "English Generation"
+- text: "[HI] आज का मौसम"
+  example_title: "Hindi Generation"
+- text: "[PA] ਅੱਜ ਦਾ ਮੌਸਮ"
+  example_title: "Punjabi Generation"
+language:
+- en
+- hi
+- pa
+datasets:
+- ai4bharat/samanantar
+- PredictiveManish/multilingual-corpus
+library_name: transformers
+---
+# Trimurti-LM: A 4.2M Parameter Multilingual Language Model
+## Model Description
+**Trimurti-LM** is a small, efficient multilingual language model trained from scratch on English, Hindi, and Punjabi text. Named after the Hindu trinity (Brahma-Vishnu-Shiva), it represents the three-fold capability of creating text, preserving meaning, and transforming across scripts.
+**Key Features:**
+- 🏗️ **Built from scratch** - No pre-trained weights used
+- 🌐 **Multilingual** - Handles 3 languages with 3 different scripts
+- 💾 **Tiny footprint** - Only 4.2 million parameters
+- ⚡ **Fast training** - 2.38 hours on consumer GPU (GTX 1650 4GB)
+- 🔤 **Smart tokenization** - Custom SentencePiece with byte fallback for Indic scripts
+## Model Specifications
+| Aspect | Details |
+|--------|---------|
+| **Architecture** | GPT-2 style decoder-only Transformer |
+| **Parameters** | 4,672,000 (4.2M) |
+| **Hidden Size** | 256 |
+| **Layers** | 4 |
+| **Attention Heads** | 8 |
+| **Context Length** | 128 tokens |
+| **Vocabulary** | 8000 tokens (SentencePiece) |
+| **Training Steps** | 5000 |
+| **Training Time** | 2.38 hours |
+| **Hardware** | NVIDIA GTX 1650 (4GB VRAM) |
+## Training Data
+The model was trained on a balanced multilingual corpus:
+- **English**: 150,000 sentences
+- **Hindi**: 150,000 sentences
+- **Punjabi**: 150,000 sentences
+**Sources:**
+- Primary: AI4Bharat Samanantar dataset (filtered and processed)
+- Secondary: Custom curated multilingual corpus
+**Data Processing:**
+- Language tagging: `[EN]`, `[HI]`, `[PA]` prefixes
+- Length filtering: 5-50 words per sentence
+- Script validation for each language
+- Deduplication and cleaning
+## Performance
+| Metric | Value | Notes |
+|--------|-------|-------|
+| **Final Loss** | 1.206 | Cross-entropy loss |
+| **Perplexity** | 3.32 | e^1.206 = 3.32 |
+| **Top-1 Accuracy** | ~25% | Next token prediction |
+| **Top-5 Accuracy** | ~60% | Next token prediction |
+| **Language ID Accuracy** | 95% | With explicit tags |
+## Usage
+### Quick Start
+```python
+from transformers import GPT2LMHeadModel
+import sentencepiece as spm
+import torch
+# Load model and tokenizer
+tokenizer = spm.SentencePieceProcessor()
+tokenizer.load("multilingual_spm.model")
+model = GPT2LMHeadModel.from_pretrained("PredictiveManish/Trimurti-LM")
+# Generate text
+prompt = "[EN] The weather is"
+input_ids = tokenizer.encode(prompt)
+input_tensor = torch.tensor([input_ids])
+with torch.no_grad():
+    output = model.generate(
+        input_ids=input_tensor,
+        max_length=50,
+        temperature=0.7,
+        do_sample=True,
+        pad_token_id=0
+    )
+generated = tokenizer.decode(output[0].tolist())
+print(generated)
+```
 datasets:
 - ai4bharat/samanantar
 - PredictiveManish/multilingual-corpus
 metrics:
 - accuracy
 pipeline_tag: text-generation
+citations(surely you're not going to use this but still, if in search of worst models):
+```
+@software{trimurti_lm_2024,
+  title = {Trimurti-LM: A 4.2M Parameter Multilingual Language Model},
+  author = {Manish},
+  year = {2024},
+  url = {https://huggingface.co/PredictiveManish/Trimurti-LM},
+  note = {Trained from scratch on English, Hindi, and Punjabi with consumer hardware}
+}
+```
 ---