Trimurti-LM / README.md

Update README.md

45247fb verified 7 days ago

4.38 kB

	---
	---
	license: apache-2.0
	tags:
	- multilingual
	- text-generation
	- indic-languages
	- hindi
	- punjabi
	- small-model
	pipeline_tag: text-generation
	widget:
	- text: "[EN] The weather today is"
	example_title: "English Generation"
	- text: "[HI] आज का मौसम"
	example_title: "Hindi Generation"
	- text: "[PA] ਅੱਜ ਦਾ ਮੌਸਮ"
	example_title: "Punjabi Generation"
	language:
	- en
	- hi
	- pa
	datasets:
	- ai4bharat/samanantar
	- PredictiveManish/multilingual-corpus
	library_name: transformers
	---

	# Trimurti-LM: A 4.2M Parameter Multilingual Language Model

	## Model Description

	Trimurti-LM is a small, efficient multilingual language model trained from scratch on English, Hindi, and Punjabi text. Named after the Hindu trinity (Brahma-Vishnu-Shiva), it represents the three-fold capability of creating text, preserving meaning, and transforming across scripts.

	Key Features:
	- 🏗️ Built from scratch - No pre-trained weights used
	- 🌐 Multilingual - Handles 3 languages with 3 different scripts
	- 💾 Tiny footprint - Only 4.2 million parameters
	- ⚡ Fast training - 2.38 hours on consumer GPU (GTX 1650 4GB)
	- 🔤 Smart tokenization - Custom SentencePiece with byte fallback for Indic scripts

	## Model Specifications

	\| Aspect \| Details \|
	\|--------\|---------\|
	\| Architecture \| GPT-2 style decoder-only Transformer \|
	\| Parameters \| 4,672,000 (4.2M) \|
	\| Hidden Size \| 256 \|
	\| Layers \| 4 \|
	\| Attention Heads \| 8 \|
	\| Context Length \| 128 tokens \|
	\| Vocabulary \| 8000 tokens (SentencePiece) \|
	\| Training Steps \| 5000 \|
	\| Training Time \| 2.38 hours \|
	\| Hardware \| NVIDIA GTX 1650 (4GB VRAM) \|

	## Training Data

	The model was trained on a balanced multilingual corpus:
	- English: 150,000 sentences
	- Hindi: 150,000 sentences
	- Punjabi: 150,000 sentences

	Sources:
	- Primary: AI4Bharat Samanantar dataset (filtered and processed)
	- Secondary: Custom curated multilingual corpus

	Data Processing:
	- Language tagging: `[EN]`, `[HI]`, `[PA]` prefixes
	- Length filtering: 5-50 words per sentence
	- Script validation for each language
	- Deduplication and cleaning

	## Performance

	\| Metric \| Value \| Notes \|
	\|--------\|-------\|-------\|
	\| Final Loss \| 1.206 \| Cross-entropy loss \|
	\| Perplexity \| 3.32 \| e^1.206 = 3.32 \|
	\| Top-1 Accuracy \| ~25% \| Next token prediction \|
	\| Top-5 Accuracy \| ~60% \| Next token prediction \|
	\| Language ID Accuracy \| 95% \| With explicit tags \|

	## Usage

	### Quick Start

	```python
	from transformers import GPT2LMHeadModel
	import sentencepiece as spm
	import torch

	# Load model and tokenizer
	tokenizer = spm.SentencePieceProcessor()
	tokenizer.load("multilingual_spm.model")
	model = GPT2LMHeadModel.from_pretrained("PredictiveManish/Trimurti-LM")

	# Generate text
	prompt = "[EN] The weather is"
	input_ids = tokenizer.encode(prompt)
	input_tensor = torch.tensor([input_ids])

	with torch.no_grad():
	output = model.generate(
	input_ids=input_tensor,
	max_length=50,
	temperature=0.7,
	do_sample=True,
	pad_token_id=0
	)

	generated = tokenizer.decode(output[0].tolist())
	print(generated)


	```

	## citations(surely you're not going to use this but still, if in search of worst models):
	If you use Trimurti-LM in your work, please cite:

	```bibtex
	@software{trimurti_lm_2026,
	title = {Trimurti-LM: A 4.2M Parameter Multilingual Language Model},
	author = {Manish Tiwari},
	year = {2026},
	url = {https://huggingface.co/PredictiveManish/Trimurti-LM},
	note = {Trained from scratch on English, Hindi, and Punjabi with consumer hardware}
	}


	```


	### Primary Dataset

	```bibtex
	@inproceedings{samanantar_2021,
	title = {Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
	author = {Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
	booktitle = {Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks},
	year = {2021},
	url = {https://arxiv.org/abs/2104.05596}
	}
	```
	---