|
|
--- |
|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- multilingual |
|
|
- text-generation |
|
|
- indic-languages |
|
|
- hindi |
|
|
- punjabi |
|
|
- small-model |
|
|
pipeline_tag: text-generation |
|
|
widget: |
|
|
- text: "[EN] The weather today is" |
|
|
example_title: "English Generation" |
|
|
- text: "[HI] आज का मौसम" |
|
|
example_title: "Hindi Generation" |
|
|
- text: "[PA] ਅੱਜ ਦਾ ਮੌਸਮ" |
|
|
example_title: "Punjabi Generation" |
|
|
language: |
|
|
- en |
|
|
- hi |
|
|
- pa |
|
|
datasets: |
|
|
- ai4bharat/samanantar |
|
|
- PredictiveManish/multilingual-corpus |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Trimurti-LM: A 4.2M Parameter Multilingual Language Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**Trimurti-LM** is a small, efficient multilingual language model trained from scratch on English, Hindi, and Punjabi text. Named after the Hindu trinity (Brahma-Vishnu-Shiva), it represents the three-fold capability of creating text, preserving meaning, and transforming across scripts. |
|
|
|
|
|
**Key Features:** |
|
|
- 🏗️ **Built from scratch** - No pre-trained weights used |
|
|
- 🌐 **Multilingual** - Handles 3 languages with 3 different scripts |
|
|
- 💾 **Tiny footprint** - Only 4.2 million parameters |
|
|
- ⚡ **Fast training** - 2.38 hours on consumer GPU (GTX 1650 4GB) |
|
|
- 🔤 **Smart tokenization** - Custom SentencePiece with byte fallback for Indic scripts |
|
|
|
|
|
## Model Specifications |
|
|
|
|
|
| Aspect | Details | |
|
|
|--------|---------| |
|
|
| **Architecture** | GPT-2 style decoder-only Transformer | |
|
|
| **Parameters** | 4,672,000 (4.2M) | |
|
|
| **Hidden Size** | 256 | |
|
|
| **Layers** | 4 | |
|
|
| **Attention Heads** | 8 | |
|
|
| **Context Length** | 128 tokens | |
|
|
| **Vocabulary** | 8000 tokens (SentencePiece) | |
|
|
| **Training Steps** | 5000 | |
|
|
| **Training Time** | 2.38 hours | |
|
|
| **Hardware** | NVIDIA GTX 1650 (4GB VRAM) | |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on a balanced multilingual corpus: |
|
|
- **English**: 150,000 sentences |
|
|
- **Hindi**: 150,000 sentences |
|
|
- **Punjabi**: 150,000 sentences |
|
|
|
|
|
**Sources:** |
|
|
- Primary: AI4Bharat Samanantar dataset (filtered and processed) |
|
|
- Secondary: Custom curated multilingual corpus |
|
|
|
|
|
**Data Processing:** |
|
|
- Language tagging: `[EN]`, `[HI]`, `[PA]` prefixes |
|
|
- Length filtering: 5-50 words per sentence |
|
|
- Script validation for each language |
|
|
- Deduplication and cleaning |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Metric | Value | Notes | |
|
|
|--------|-------|-------| |
|
|
| **Final Loss** | 1.206 | Cross-entropy loss | |
|
|
| **Perplexity** | 3.32 | e^1.206 = 3.32 | |
|
|
| **Top-1 Accuracy** | ~25% | Next token prediction | |
|
|
| **Top-5 Accuracy** | ~60% | Next token prediction | |
|
|
| **Language ID Accuracy** | 95% | With explicit tags | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import GPT2LMHeadModel |
|
|
import sentencepiece as spm |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = spm.SentencePieceProcessor() |
|
|
tokenizer.load("multilingual_spm.model") |
|
|
model = GPT2LMHeadModel.from_pretrained("PredictiveManish/Trimurti-LM") |
|
|
|
|
|
# Generate text |
|
|
prompt = "[EN] The weather is" |
|
|
input_ids = tokenizer.encode(prompt) |
|
|
input_tensor = torch.tensor([input_ids]) |
|
|
|
|
|
with torch.no_grad(): |
|
|
output = model.generate( |
|
|
input_ids=input_tensor, |
|
|
max_length=50, |
|
|
temperature=0.7, |
|
|
do_sample=True, |
|
|
pad_token_id=0 |
|
|
) |
|
|
|
|
|
generated = tokenizer.decode(output[0].tolist()) |
|
|
print(generated) |
|
|
|
|
|
|
|
|
``` |
|
|
|
|
|
## citations(surely you're not going to use this but still, if in search of worst models): |
|
|
If you use Trimurti-LM in your work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@software{trimurti_lm_2026, |
|
|
title = {Trimurti-LM: A 4.2M Parameter Multilingual Language Model}, |
|
|
author = {Manish Tiwari}, |
|
|
year = {2026}, |
|
|
url = {https://huggingface.co/PredictiveManish/Trimurti-LM}, |
|
|
note = {Trained from scratch on English, Hindi, and Punjabi with consumer hardware} |
|
|
} |
|
|
|
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
### Primary Dataset |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{samanantar_2021, |
|
|
title = {Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages}, |
|
|
author = {Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra}, |
|
|
booktitle = {Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks}, |
|
|
year = {2021}, |
|
|
url = {https://arxiv.org/abs/2104.05596} |
|
|
} |
|
|
``` |
|
|
--- |