PredictiveManish commited on
Commit
d550d77
·
verified ·
1 Parent(s): a98d412

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +128 -0
README.md CHANGED
@@ -1,4 +1,120 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  datasets:
3
  - ai4bharat/samanantar
4
  - PredictiveManish/multilingual-corpus
@@ -9,4 +125,16 @@ language:
9
  metrics:
10
  - accuracy
11
  pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
 
1
  ---
2
+ ---
3
+ license: apache-2.0
4
+ tags:
5
+ - multilingual
6
+ - text-generation
7
+ - indic-languages
8
+ - hindi
9
+ - punjabi
10
+ - small-model
11
+ pipeline_tag: text-generation
12
+ widget:
13
+ - text: "[EN] The weather today is"
14
+ example_title: "English Generation"
15
+ - text: "[HI] आज का मौसम"
16
+ example_title: "Hindi Generation"
17
+ - text: "[PA] ਅੱਜ ਦਾ ਮੌਸਮ"
18
+ example_title: "Punjabi Generation"
19
+ language:
20
+ - en
21
+ - hi
22
+ - pa
23
+ datasets:
24
+ - ai4bharat/samanantar
25
+ - PredictiveManish/multilingual-corpus
26
+ library_name: transformers
27
+ ---
28
+
29
+ # Trimurti-LM: A 4.2M Parameter Multilingual Language Model
30
+
31
+ ## Model Description
32
+
33
+ **Trimurti-LM** is a small, efficient multilingual language model trained from scratch on English, Hindi, and Punjabi text. Named after the Hindu trinity (Brahma-Vishnu-Shiva), it represents the three-fold capability of creating text, preserving meaning, and transforming across scripts.
34
+
35
+ **Key Features:**
36
+ - 🏗️ **Built from scratch** - No pre-trained weights used
37
+ - 🌐 **Multilingual** - Handles 3 languages with 3 different scripts
38
+ - 💾 **Tiny footprint** - Only 4.2 million parameters
39
+ - ⚡ **Fast training** - 2.38 hours on consumer GPU (GTX 1650 4GB)
40
+ - 🔤 **Smart tokenization** - Custom SentencePiece with byte fallback for Indic scripts
41
+
42
+ ## Model Specifications
43
+
44
+ | Aspect | Details |
45
+ |--------|---------|
46
+ | **Architecture** | GPT-2 style decoder-only Transformer |
47
+ | **Parameters** | 4,672,000 (4.2M) |
48
+ | **Hidden Size** | 256 |
49
+ | **Layers** | 4 |
50
+ | **Attention Heads** | 8 |
51
+ | **Context Length** | 128 tokens |
52
+ | **Vocabulary** | 8000 tokens (SentencePiece) |
53
+ | **Training Steps** | 5000 |
54
+ | **Training Time** | 2.38 hours |
55
+ | **Hardware** | NVIDIA GTX 1650 (4GB VRAM) |
56
+
57
+ ## Training Data
58
+
59
+ The model was trained on a balanced multilingual corpus:
60
+ - **English**: 150,000 sentences
61
+ - **Hindi**: 150,000 sentences
62
+ - **Punjabi**: 150,000 sentences
63
+
64
+ **Sources:**
65
+ - Primary: AI4Bharat Samanantar dataset (filtered and processed)
66
+ - Secondary: Custom curated multilingual corpus
67
+
68
+ **Data Processing:**
69
+ - Language tagging: `[EN]`, `[HI]`, `[PA]` prefixes
70
+ - Length filtering: 5-50 words per sentence
71
+ - Script validation for each language
72
+ - Deduplication and cleaning
73
+
74
+ ## Performance
75
+
76
+ | Metric | Value | Notes |
77
+ |--------|-------|-------|
78
+ | **Final Loss** | 1.206 | Cross-entropy loss |
79
+ | **Perplexity** | 3.32 | e^1.206 = 3.32 |
80
+ | **Top-1 Accuracy** | ~25% | Next token prediction |
81
+ | **Top-5 Accuracy** | ~60% | Next token prediction |
82
+ | **Language ID Accuracy** | 95% | With explicit tags |
83
+
84
+ ## Usage
85
+
86
+ ### Quick Start
87
+
88
+ ```python
89
+ from transformers import GPT2LMHeadModel
90
+ import sentencepiece as spm
91
+ import torch
92
+
93
+ # Load model and tokenizer
94
+ tokenizer = spm.SentencePieceProcessor()
95
+ tokenizer.load("multilingual_spm.model")
96
+ model = GPT2LMHeadModel.from_pretrained("PredictiveManish/Trimurti-LM")
97
+
98
+ # Generate text
99
+ prompt = "[EN] The weather is"
100
+ input_ids = tokenizer.encode(prompt)
101
+ input_tensor = torch.tensor([input_ids])
102
+
103
+ with torch.no_grad():
104
+ output = model.generate(
105
+ input_ids=input_tensor,
106
+ max_length=50,
107
+ temperature=0.7,
108
+ do_sample=True,
109
+ pad_token_id=0
110
+ )
111
+
112
+ generated = tokenizer.decode(output[0].tolist())
113
+ print(generated)
114
+
115
+
116
+ ```
117
+
118
  datasets:
119
  - ai4bharat/samanantar
120
  - PredictiveManish/multilingual-corpus
 
125
  metrics:
126
  - accuracy
127
  pipeline_tag: text-generation
128
+
129
+
130
+ citations(surely you're not going to use this but still, if in search of worst models):
131
+ ```
132
+ @software{trimurti_lm_2024,
133
+ title = {Trimurti-LM: A 4.2M Parameter Multilingual Language Model},
134
+ author = {Manish},
135
+ year = {2024},
136
+ url = {https://huggingface.co/PredictiveManish/Trimurti-LM},
137
+ note = {Trained from scratch on English, Hindi, and Punjabi with consumer hardware}
138
+ }
139
+ ```
140
  ---