abirmaheshwari commited on
Commit
e5fb94e
·
verified ·
1 Parent(s): 086c624

Upload 6 files

Browse files

![abirhinv1](https://cdn-uploads.huggingface.co/production/uploads/6626d5bd866dbe78089d3b23/LlUU2RvTjr4y3hDHc6NI-.png)

Files changed (6) hide show
  1. LICENSE +6 -0
  2. README.MD +138 -0
  3. config.json +16 -0
  4. model.safetensors +3 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +9 -0
LICENSE ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ @misc{abirhinv1,
2
+ author = {Abir Maheshwari},
3
+ title = {ABIRHINv1: Pure Hindi Model from Scratch},
4
+ year = {2026},
5
+ url = {https://huggingface.co/AbirMaheshwari/ABIRHINv1}
6
+ }
README.MD ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \# ABIRHINv1
2
+
3
+
4
+
5
+ \*\*Pure Hindi Language Model – Built Entirely From Scratch\*\*
6
+
7
+
8
+
9
+ \*\*Version 1\*\* – Created by Abir Maheshwari in February 2026 using only Google Colab free tier (T4 GPU). ≈100 million parameters.
10
+
11
+
12
+
13
+ \### About the Model
14
+
15
+ ABIRHINv1 is the second member of the ABIR Indic SLM Family (after Marathi).
16
+
17
+ It is a \*\*decoder-only causal language model\*\* built \*\*100% from scratch\*\*:
18
+
19
+
20
+
21
+ \- Random weight initialization (no pretrained checkpoints or base models)
22
+
23
+ \- Custom architecture using PyTorch `nn.TransformerDecoder` layers
24
+
25
+ \- Custom tokenizer trained only on Hindi data (Byte-level BPE from zero – no inheritance from any existing tokenizer)
26
+
27
+ \- Trained exclusively on Hindi text (IndicCorpV2), Romanized Hindi (Bhasha-Abhijnaanam), creator personality, and translation pairs
28
+
29
+
30
+
31
+ Generates fluent Hindi (Devanagari), understands Romanized input, basic English → Hindi translation.
32
+
33
+
34
+
35
+ \### Purpose \& Motive
36
+
37
+ \*\*Purpose\*\*: Tiny, offline Hindi AI for millions in North India – no internet, low-end devices.
38
+
39
+ \*\*Motive\*\*: Show anyone can build Indic models from scratch. Empowering Hindi speakers in the AI era.
40
+
41
+
42
+
43
+ \### Target Audience
44
+
45
+ \- Hindi families \& kids
46
+
47
+ \- North India users (daily chat, news, forms)
48
+
49
+ \- Students, writers, teachers
50
+
51
+ \- Offline developers
52
+
53
+
54
+
55
+ \### Capabilities (Version 1)
56
+
57
+ \- Fluent Hindi generation
58
+
59
+ \- Romanized understanding ("Main kya karun?")
60
+
61
+ \- English → Hindi translation
62
+
63
+ \- Knows creator: \*\*Abir Maheshwari from Mumbai\*\*
64
+
65
+ \- ~400–500 MB size – offline fast
66
+
67
+
68
+
69
+ \### Use Cases
70
+
71
+ 1\. Family Hindi chatbot
72
+
73
+ 2\. Stories \& poems
74
+
75
+ 3\. Writing help
76
+
77
+ 4\. Quick translation
78
+
79
+ 5\. Offline learning
80
+
81
+
82
+
83
+ \### Creator Information
84
+
85
+ \*\*Created by\*\*: Abir Maheshwari (Mumbai, Maharashtra, India)
86
+
87
+ \*\*Writer • Programmer • Entrepreneur • Artist\*\*
88
+
89
+
90
+
91
+ \*\*Follow me:\*\*
92
+
93
+ \- X / Twitter: \[@AbirMaheshwari](https://x.com/AbirMaheshwari)
94
+
95
+ \- Instagram: \[@anantraga31](https://instagram.com/anantraga31)
96
+
97
+ \- LinkedIn: \[Abir Maheshwari](https://linkedin.com/in/abirmaheshwari)
98
+
99
+
100
+
101
+ \*\*Model says\*\*: "मैं ABIRHINv1 हूँ। मेरे निर्माता अभीर महेश्वरी हैं।"
102
+
103
+
104
+
105
+ \### Technical Details
106
+
107
+ \- Architecture: Custom decoder-only (10 layers, 640 dim, 10 heads, GELU, learnable pos)
108
+
109
+ \- Parameters: ≈100 million
110
+
111
+ \- From scratch: Yes
112
+
113
+ \- Tokenizer: Byte-level BPE from zero (32k vocab)
114
+
115
+ \- Dataset: IndicCorpV2 (hin\_Deva), Bhasha-Abhijnaanam Hindi, custom pairs
116
+
117
+ \- Compute: Colab free T4 (~1–2 hours)
118
+
119
+
120
+
121
+ \### Limitations
122
+
123
+ Small model – basic fluency, short context, no real-time knowledge.
124
+
125
+
126
+
127
+ \### How to Use
128
+
129
+ ```python
130
+
131
+ from transformers import pipeline
132
+
133
+
134
+
135
+ pipe = pipeline("text-generation", model="AbirMaheshwari/ABIRHINv1")
136
+
137
+ print(pipe("मेरे निर्माता कौन हैं?", max\_new\_tokens=80)\[0]\["generated\_text"])
138
+
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ABIRForCausalLM"
4
+ ],
5
+ "dropout": 0.1,
6
+ "dtype": "float32",
7
+ "hidden_size": 640,
8
+ "intermediate_size": 2560,
9
+ "max_position_embeddings": 512,
10
+ "model_type": "abir-slm",
11
+ "num_heads": 10,
12
+ "num_layers": 10,
13
+ "transformers_version": "5.0.0",
14
+ "use_cache": false,
15
+ "vocab_size": 32000
16
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:db193b7f8dd126b7345e46c25df6d6a1f26c3c5c205fd57721b75b4a9190105f
3
+ size 427800504
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<bos>",
4
+ "eos_token": "<eos>",
5
+ "model_max_length": 1000000000000000019884624838656,
6
+ "pad_token": "<pad>",
7
+ "tokenizer_class": "TokenizersBackend",
8
+ "unk_token": "<unk>"
9
+ }