File size: 4,376 Bytes
a98d412
d550d77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1db9e72
 
 
 
 
d550d77
1db9e72
 
d550d77
 
 
1db9e72
 
45247fb
1db9e72
 
 
 
 
 
 
 
 
 
 
 
45247fb
a98d412
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
---
license: apache-2.0
tags:
- multilingual
- text-generation
- indic-languages
- hindi
- punjabi
- small-model
pipeline_tag: text-generation
widget:
- text: "[EN] The weather today is"
  example_title: "English Generation"
- text: "[HI] आज का मौसम"
  example_title: "Hindi Generation"
- text: "[PA] ਅੱਜ ਦਾ ਮੌਸਮ"
  example_title: "Punjabi Generation"
language:
- en
- hi
- pa
datasets:
- ai4bharat/samanantar
- PredictiveManish/multilingual-corpus
library_name: transformers
---

# Trimurti-LM: A 4.2M Parameter Multilingual Language Model

## Model Description

**Trimurti-LM** is a small, efficient multilingual language model trained from scratch on English, Hindi, and Punjabi text. Named after the Hindu trinity (Brahma-Vishnu-Shiva), it represents the three-fold capability of creating text, preserving meaning, and transforming across scripts.

**Key Features:**
- 🏗️ **Built from scratch** - No pre-trained weights used
- 🌐 **Multilingual** - Handles 3 languages with 3 different scripts
- 💾 **Tiny footprint** - Only 4.2 million parameters
-**Fast training** - 2.38 hours on consumer GPU (GTX 1650 4GB)
- 🔤 **Smart tokenization** - Custom SentencePiece with byte fallback for Indic scripts

## Model Specifications

| Aspect | Details |
|--------|---------|
| **Architecture** | GPT-2 style decoder-only Transformer |
| **Parameters** | 4,672,000 (4.2M) |
| **Hidden Size** | 256 |
| **Layers** | 4 |
| **Attention Heads** | 8 |
| **Context Length** | 128 tokens |
| **Vocabulary** | 8000 tokens (SentencePiece) |
| **Training Steps** | 5000 |
| **Training Time** | 2.38 hours |
| **Hardware** | NVIDIA GTX 1650 (4GB VRAM) |

## Training Data

The model was trained on a balanced multilingual corpus:
- **English**: 150,000 sentences
- **Hindi**: 150,000 sentences  
- **Punjabi**: 150,000 sentences

**Sources:**
- Primary: AI4Bharat Samanantar dataset (filtered and processed)
- Secondary: Custom curated multilingual corpus

**Data Processing:**
- Language tagging: `[EN]`, `[HI]`, `[PA]` prefixes
- Length filtering: 5-50 words per sentence
- Script validation for each language
- Deduplication and cleaning

## Performance

| Metric | Value | Notes |
|--------|-------|-------|
| **Final Loss** | 1.206 | Cross-entropy loss |
| **Perplexity** | 3.32 | e^1.206 = 3.32 |
| **Top-1 Accuracy** | ~25% | Next token prediction |
| **Top-5 Accuracy** | ~60% | Next token prediction |
| **Language ID Accuracy** | 95% | With explicit tags |

## Usage

### Quick Start

```python
from transformers import GPT2LMHeadModel
import sentencepiece as spm
import torch

# Load model and tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("multilingual_spm.model")
model = GPT2LMHeadModel.from_pretrained("PredictiveManish/Trimurti-LM")

# Generate text
prompt = "[EN] The weather is"
input_ids = tokenizer.encode(prompt)
input_tensor = torch.tensor([input_ids])

with torch.no_grad():
    output = model.generate(
        input_ids=input_tensor,
        max_length=50,
        temperature=0.7,
        do_sample=True,
        pad_token_id=0
    )

generated = tokenizer.decode(output[0].tolist())
print(generated)


```

## citations(surely you're not going to use this but still, if in search of worst models): 
If you use Trimurti-LM in your work, please cite:

```bibtex
@software{trimurti_lm_2026,
  title = {Trimurti-LM: A 4.2M Parameter Multilingual Language Model},
  author = {Manish Tiwari},
  year = {2026},
  url = {https://huggingface.co/PredictiveManish/Trimurti-LM},
  note = {Trained from scratch on English, Hindi, and Punjabi with consumer hardware}
}


```


### Primary Dataset

```bibtex
@inproceedings{samanantar_2021,
  title = {Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
  author = {Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
  booktitle = {Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks},
  year = {2021},
  url = {https://arxiv.org/abs/2104.05596}
}
```
---