--- --- license: apache-2.0 tags: - multilingual - text-generation - indic-languages - hindi - punjabi - small-model pipeline_tag: text-generation widget: - text: "[EN] The weather today is" example_title: "English Generation" - text: "[HI] आज का मौसम" example_title: "Hindi Generation" - text: "[PA] ਅੱਜ ਦਾ ਮੌਸਮ" example_title: "Punjabi Generation" language: - en - hi - pa datasets: - ai4bharat/samanantar - PredictiveManish/multilingual-corpus library_name: transformers --- # Trimurti-LM: A 4.2M Parameter Multilingual Language Model ## Model Description **Trimurti-LM** is a small, efficient multilingual language model trained from scratch on English, Hindi, and Punjabi text. Named after the Hindu trinity (Brahma-Vishnu-Shiva), it represents the three-fold capability of creating text, preserving meaning, and transforming across scripts. **Key Features:** - 🏗️ **Built from scratch** - No pre-trained weights used - 🌐 **Multilingual** - Handles 3 languages with 3 different scripts - 💾 **Tiny footprint** - Only 4.2 million parameters - ⚡ **Fast training** - 2.38 hours on consumer GPU (GTX 1650 4GB) - 🔤 **Smart tokenization** - Custom SentencePiece with byte fallback for Indic scripts ## Model Specifications | Aspect | Details | |--------|---------| | **Architecture** | GPT-2 style decoder-only Transformer | | **Parameters** | 4,672,000 (4.2M) | | **Hidden Size** | 256 | | **Layers** | 4 | | **Attention Heads** | 8 | | **Context Length** | 128 tokens | | **Vocabulary** | 8000 tokens (SentencePiece) | | **Training Steps** | 5000 | | **Training Time** | 2.38 hours | | **Hardware** | NVIDIA GTX 1650 (4GB VRAM) | ## Training Data The model was trained on a balanced multilingual corpus: - **English**: 150,000 sentences - **Hindi**: 150,000 sentences - **Punjabi**: 150,000 sentences **Sources:** - Primary: AI4Bharat Samanantar dataset (filtered and processed) - Secondary: Custom curated multilingual corpus **Data Processing:** - Language tagging: `[EN]`, `[HI]`, `[PA]` prefixes - Length filtering: 5-50 words per sentence - Script validation for each language - Deduplication and cleaning ## Performance | Metric | Value | Notes | |--------|-------|-------| | **Final Loss** | 1.206 | Cross-entropy loss | | **Perplexity** | 3.32 | e^1.206 = 3.32 | | **Top-1 Accuracy** | ~25% | Next token prediction | | **Top-5 Accuracy** | ~60% | Next token prediction | | **Language ID Accuracy** | 95% | With explicit tags | ## Usage ### Quick Start ```python from transformers import GPT2LMHeadModel import sentencepiece as spm import torch # Load model and tokenizer tokenizer = spm.SentencePieceProcessor() tokenizer.load("multilingual_spm.model") model = GPT2LMHeadModel.from_pretrained("PredictiveManish/Trimurti-LM") # Generate text prompt = "[EN] The weather is" input_ids = tokenizer.encode(prompt) input_tensor = torch.tensor([input_ids]) with torch.no_grad(): output = model.generate( input_ids=input_tensor, max_length=50, temperature=0.7, do_sample=True, pad_token_id=0 ) generated = tokenizer.decode(output[0].tolist()) print(generated) ``` ## citations(surely you're not going to use this but still, if in search of worst models): If you use Trimurti-LM in your work, please cite: ```bibtex @software{trimurti_lm_2026, title = {Trimurti-LM: A 4.2M Parameter Multilingual Language Model}, author = {Manish Tiwari}, year = {2026}, url = {https://huggingface.co/PredictiveManish/Trimurti-LM}, note = {Trained from scratch on English, Hindi, and Punjabi with consumer hardware} } ``` ### Primary Dataset ```bibtex @inproceedings{samanantar_2021, title = {Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages}, author = {Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra}, booktitle = {Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks}, year = {2021}, url = {https://arxiv.org/abs/2104.05596} } ``` ---