File size: 4,376 Bytes
a98d412 d550d77 1db9e72 d550d77 1db9e72 d550d77 1db9e72 45247fb 1db9e72 45247fb a98d412 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
---
license: apache-2.0
tags:
- multilingual
- text-generation
- indic-languages
- hindi
- punjabi
- small-model
pipeline_tag: text-generation
widget:
- text: "[EN] The weather today is"
example_title: "English Generation"
- text: "[HI] आज का मौसम"
example_title: "Hindi Generation"
- text: "[PA] ਅੱਜ ਦਾ ਮੌਸਮ"
example_title: "Punjabi Generation"
language:
- en
- hi
- pa
datasets:
- ai4bharat/samanantar
- PredictiveManish/multilingual-corpus
library_name: transformers
---
# Trimurti-LM: A 4.2M Parameter Multilingual Language Model
## Model Description
**Trimurti-LM** is a small, efficient multilingual language model trained from scratch on English, Hindi, and Punjabi text. Named after the Hindu trinity (Brahma-Vishnu-Shiva), it represents the three-fold capability of creating text, preserving meaning, and transforming across scripts.
**Key Features:**
- 🏗️ **Built from scratch** - No pre-trained weights used
- 🌐 **Multilingual** - Handles 3 languages with 3 different scripts
- 💾 **Tiny footprint** - Only 4.2 million parameters
- ⚡ **Fast training** - 2.38 hours on consumer GPU (GTX 1650 4GB)
- 🔤 **Smart tokenization** - Custom SentencePiece with byte fallback for Indic scripts
## Model Specifications
| Aspect | Details |
|--------|---------|
| **Architecture** | GPT-2 style decoder-only Transformer |
| **Parameters** | 4,672,000 (4.2M) |
| **Hidden Size** | 256 |
| **Layers** | 4 |
| **Attention Heads** | 8 |
| **Context Length** | 128 tokens |
| **Vocabulary** | 8000 tokens (SentencePiece) |
| **Training Steps** | 5000 |
| **Training Time** | 2.38 hours |
| **Hardware** | NVIDIA GTX 1650 (4GB VRAM) |
## Training Data
The model was trained on a balanced multilingual corpus:
- **English**: 150,000 sentences
- **Hindi**: 150,000 sentences
- **Punjabi**: 150,000 sentences
**Sources:**
- Primary: AI4Bharat Samanantar dataset (filtered and processed)
- Secondary: Custom curated multilingual corpus
**Data Processing:**
- Language tagging: `[EN]`, `[HI]`, `[PA]` prefixes
- Length filtering: 5-50 words per sentence
- Script validation for each language
- Deduplication and cleaning
## Performance
| Metric | Value | Notes |
|--------|-------|-------|
| **Final Loss** | 1.206 | Cross-entropy loss |
| **Perplexity** | 3.32 | e^1.206 = 3.32 |
| **Top-1 Accuracy** | ~25% | Next token prediction |
| **Top-5 Accuracy** | ~60% | Next token prediction |
| **Language ID Accuracy** | 95% | With explicit tags |
## Usage
### Quick Start
```python
from transformers import GPT2LMHeadModel
import sentencepiece as spm
import torch
# Load model and tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("multilingual_spm.model")
model = GPT2LMHeadModel.from_pretrained("PredictiveManish/Trimurti-LM")
# Generate text
prompt = "[EN] The weather is"
input_ids = tokenizer.encode(prompt)
input_tensor = torch.tensor([input_ids])
with torch.no_grad():
output = model.generate(
input_ids=input_tensor,
max_length=50,
temperature=0.7,
do_sample=True,
pad_token_id=0
)
generated = tokenizer.decode(output[0].tolist())
print(generated)
```
## citations(surely you're not going to use this but still, if in search of worst models):
If you use Trimurti-LM in your work, please cite:
```bibtex
@software{trimurti_lm_2026,
title = {Trimurti-LM: A 4.2M Parameter Multilingual Language Model},
author = {Manish Tiwari},
year = {2026},
url = {https://huggingface.co/PredictiveManish/Trimurti-LM},
note = {Trained from scratch on English, Hindi, and Punjabi with consumer hardware}
}
```
### Primary Dataset
```bibtex
@inproceedings{samanantar_2021,
title = {Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
author = {Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
booktitle = {Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks},
year = {2021},
url = {https://arxiv.org/abs/2104.05596}
}
```
--- |