Badnyal commited on
Commit
7cf7d0a
·
verified ·
1 Parent(s): e90c82a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +250 -2
README.md CHANGED
@@ -1,3 +1,251 @@
1
- # NE-LID fastText model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- Model files uploaded. Model card coming next.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - as
4
+ - brx
5
+ - en
6
+ - grt
7
+ - hi
8
+ - kha
9
+ - trp
10
+ - mni
11
+ - lus
12
+ - njz
13
+ - njo
14
+ tags:
15
+ - language-identification
16
+ - fasttext
17
+ - northeast-india
18
+ - low-resource
19
+ - multilingual
20
+ license: cc-by-4.0
21
+ metrics:
22
+ - accuracy
23
+ - f1
24
+ library_name: fasttext
25
+ pipeline_tag: text-classification
26
+ model-index:
27
+ - name: NE-LID
28
+ results:
29
+ - task:
30
+ type: text-classification
31
+ name: Language Identification
32
+ metrics:
33
+ - type: accuracy
34
+ value: 99.09
35
+ name: Test Accuracy
36
+ - type: f1
37
+ value: 99
38
+ name: Macro F1-Score
39
+ ---
40
+ # NE-LID: Northeast Language Identification
41
 
42
+ ![License](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)
43
+ ![Accuracy](https://img.shields.io/badge/Accuracy-99.09%25-brightgreen)
44
+
45
+ NE-LID is a **sentence-level language identification model** for low-resource languages of **Northeast India**, trained using a **character n-gram fastText classifier**.
46
+
47
+ The model achieves **near-ceiling accuracy (99.1%)** and is designed to be **fast, robust, and reproducible**, especially for script-diverse and low-resource settings.
48
+
49
+ ---
50
+
51
+ ## 🌐 Supported Languages (11)
52
+
53
+ | Language | Family | Script |
54
+ |----------|--------|--------|
55
+ | Assamese | Indo-Aryan | Bengali-Assamese |
56
+ | Bodo | Tibeto-Burman | Devanagari |
57
+ | English | Germanic | Latin |
58
+ | Garo | Tibeto-Burman | Latin |
59
+ | Hindi | Indo-Aryan | Devanagari |
60
+ | Khasi | Austroasiatic | Latin |
61
+ | Kokborok | Tibeto-Burman | Latin |
62
+ | Meitei | Tibeto-Burman | Bengali |
63
+ | Mizo | Tibeto-Burman | Latin |
64
+ | Naga | Tibeto-Burman | Latin |
65
+ | Nyishi | Tibeto-Burman | Latin |
66
+
67
+ ---
68
+
69
+ ## 📊 Model Details
70
+
71
+ - **Model type**: fastText supervised classifier
72
+ - **Architecture**: Character n-grams (2–5)
73
+ - **Task**: Sentence-level Language Identification (LID)
74
+ - **Training data**: 22,000 sentences (2,000 per language)
75
+ - **Train / Dev / Test split**: 70% / 15% / 15% (stratified)
76
+ - **Evaluation accuracy**: **99.09%** (macro-F1: 0.99)
77
+ - **Model size**: ~10 MB
78
+ - **Inference speed**: <5ms per sentence
79
+
80
+ ---
81
+
82
+ ## 🎯 Why fastText?
83
+
84
+ Extensive experiments show that **character-level models outperform transformer-based language models** (e.g., NE-BERT, XLM-R) for Northeast Indian LID.
85
+
86
+ **Key findings:**
87
+ - Transformer models (NE-BERT, XLM-R) achieved only 9-37% accuracy on challenging samples
88
+ - fastText maintained 99%+ accuracy even on script-diverse, low-resource languages
89
+ - Character n-grams capture orthographic patterns better than subword tokenization for these languages
90
+
91
+ This model therefore prioritizes:
92
+ - ✅ Script awareness
93
+ - ✅ Orthographic cues
94
+ - ✅ Low-resource robustness
95
+
96
+ ---
97
+
98
+ ## 📈 Performance
99
+
100
+ | Language | Precision | Recall | F1-Score | Support |
101
+ |----------|-----------|--------|----------|---------|
102
+ | Assamese | 1.00 | 1.00 | 1.00 | 300 |
103
+ | Bodo | 0.99 | 0.98 | 0.99 | 300 |
104
+ | English | 0.96 | 0.99 | 0.98 | 300 |
105
+ | Garo | 0.99 | 1.00 | 1.00 | 300 |
106
+ | Hindi | 0.96 | 0.97 | 0.97 | 300 |
107
+ | Khasi | 1.00 | 0.99 | 0.99 | 300 |
108
+ | Kokborok | 1.00 | 0.99 | 1.00 | 300 |
109
+ | Meitei | 1.00 | 0.99 | 1.00 | 300 |
110
+ | Mizo | 0.99 | 0.99 | 0.99 | 300 |
111
+ | Naga | 1.00 | 1.00 | 1.00 | 300 |
112
+ | Nyishi | 1.00 | 0.99 | 0.99 | 300 |
113
+ | **Overall** | **0.99** | **0.99** | **0.99** | **3,300** |
114
+
115
+ **Test Accuracy: 99.09%**
116
+
117
+ ---
118
+
119
+ ## 🚀 Installation
120
+ ```bash
121
+ pip install fasttext
122
+ ```
123
+
124
+ ---
125
+
126
+ ## 💻 Usage
127
+
128
+ ### Basic Usage (Python)
129
+ ```python
130
+ import fasttext
131
+
132
+ # Load the model
133
+ model = fasttext.load_model("ne_lid.bin")
134
+
135
+ # Predict language
136
+ text = "Ki paidbah shnong ki la ia shim bynta ha ka jingïalang"
137
+ labels, probs = model.predict(text)
138
+
139
+ print(f"Language: {labels[0].replace('__label__', '')}")
140
+ print(f"Confidence: {probs[0]:.4f}")
141
+ ```
142
+
143
+ **Output:**
144
+ ```
145
+ Language: khasi
146
+ Confidence: 0.9999
147
+ ```
148
+
149
+ ### Batch Prediction
150
+ ```python
151
+ texts = [
152
+ "Ka sngi ka lieh",
153
+ "আজি মই বজাৰলৈ গৈছিলোঁ",
154
+ "Mizo tawng hi a ṭha hle"
155
+ ]
156
+
157
+ predictions = model.predict(texts)
158
+ for text, (label, prob) in zip(texts, zip(*predictions)):
159
+ lang = label.replace('__label__', '')
160
+ print(f"{text[:30]:30} → {lang:10} ({prob:.3f})")
161
+ ```
162
+
163
+ ### Get Top-K Predictions
164
+ ```python
165
+ # Get top 3 language predictions
166
+ labels, probs = model.predict(text, k=3)
167
+
168
+ for label, prob in zip(labels, probs):
169
+ lang = label.replace('__label__', '')
170
+ print(f"{lang}: {prob:.4f}")
171
+ ```
172
+
173
+ ---
174
+
175
+ ## ⚠️ Limitations
176
+
177
+ - **Designed for monolingual sentences** – not optimized for code-mixed text
178
+ - **Sentence-level only** – not designed for word-level or document-level LID
179
+ - **Performance may degrade** on extremely short inputs (≤2 tokens)
180
+ - **English/Hindi confusion** at 96-97% (expected due to loanwords and script overlap)
181
+
182
+ ---
183
+
184
+ ## 📦 Model Files
185
+
186
+ - `ne_lid.bin` - Main fastText model (binary format)
187
+ - `ne_lid.ftz` - Compressed model (optional, for smaller deployments)
188
+
189
+ ---
190
+
191
+ ## 🔬 Training Details
192
+
193
+ **Data Sources:**
194
+ - Training corpus derived from NE-BERT dataset
195
+ - 2,000 sentences per language, stratified by length and script
196
+ - Balanced across language families (Austroasiatic, Tibeto-Burman, Indo-Aryan)
197
+
198
+ **Hyperparameters:**
199
+ - Learning rate: 0.1
200
+ - Epochs: 25
201
+ - Word n-grams: 1-3
202
+ - Character n-grams: 2-5
203
+ - Loss function: Softmax
204
+
205
+ ---
206
+
207
+ ## 📄 License
208
+
209
+ This model is released under **Creative Commons Attribution 4.0 International (CC BY 4.0)**.
210
+
211
+ You are free to:
212
+ - ✅ Share — copy and redistribute the material
213
+ - ✅ Adapt — remix, transform, and build upon the material
214
+
215
+ Under the following terms:
216
+ - 📌 Attribution — You must give appropriate credit to MWire Labs
217
+
218
+ ---
219
+
220
+ ## 📚 Citation
221
+
222
+ If you use NE-LID in your research or applications, please cite:
223
+ ```bibtex
224
+ @misc{mwirelabs2025nelid,
225
+ title={NE-LID: Northeast Language Identification},
226
+ author={MWire Labs},
227
+ year={2025},
228
+ publisher={HuggingFace},
229
+ howpublished={\url{https://huggingface.co/MWirelabs/ne-lid}}
230
+ }
231
+ ```
232
+
233
+ ---
234
+
235
+ ## 🏢 About MWire Labs
236
+
237
+ **MWire Labs** is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.
238
+
239
+ **Repository:** [MWirelabs/ne-lid](https://huggingface.co/MWirelabs/ne-lid)
240
+ **Contact:** [MWire Labs](https://mwirelabs.com)
241
+
242
+ ---
243
+
244
+ ## 🙏 Acknowledgments
245
+
246
+ We thank the open-source community and contributors to the NE-BERT corpus that made this work possible.
247
+
248
+ ---
249
+
250
+ **Last Updated:** January 2025
251
+ **Version:** 1.0.0