Update README.md
Browse files
README.md
CHANGED
|
@@ -66,6 +66,7 @@ Evaluated on a 5k-sentence dev set:
|
|
| 66 |
| **Overall** | **0.962** | **0.967** | **0.964** |
|
| 67 |
|
| 68 |
โ ๏ธ Low scores for **TOURIST / FAUNA** due to very few training examples โ performance will improve with more labeled data.
|
|
|
|
| 69 |
|
| 70 |
---
|
| 71 |
|
|
@@ -80,6 +81,12 @@ Evaluated on a 5k-sentence dev set:
|
|
| 80 |
* **Optimizer**: AdamW
|
| 81 |
* **Framework**: HuggingFace Transformers Trainer API
|
| 82 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
### ๐ง Environment
|
| 84 |
|
| 85 |
* **Transformers**: 4.44.2
|
|
@@ -125,6 +132,13 @@ You are free to use, share, and adapt the model for non-commercial purposes with
|
|
| 125 |
|
| 126 |
---
|
| 127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
## ๐ Citation
|
| 129 |
|
| 130 |
If you use this model in your research, please cite:
|
|
@@ -141,6 +155,19 @@ If you use this model in your research, please cite:
|
|
| 141 |
|
| 142 |
---
|
| 143 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
## ๐ข About
|
| 145 |
|
| 146 |
This model is developed by **MWirelabs**, pioneering AI solutions for the rich cultural and linguistic diversity of **Northeast India**.
|
|
|
|
| 66 |
| **Overall** | **0.962** | **0.967** | **0.964** |
|
| 67 |
|
| 68 |
โ ๏ธ Low scores for **TOURIST / FAUNA** due to very few training examples โ performance will improve with more labeled data.
|
| 69 |
+
Note: The current evaluation set does not include enough examples of **NAMES**, so that category is not reported in the table. Training data did include a small gazetteer of Khasi and regional names (~81 entries), but more labeled examples are needed for meaningful evaluation.
|
| 70 |
|
| 71 |
---
|
| 72 |
|
|
|
|
| 81 |
* **Optimizer**: AdamW
|
| 82 |
* **Framework**: HuggingFace Transformers Trainer API
|
| 83 |
|
| 84 |
+
### ๐ฆ Dataset Size
|
| 85 |
+
- Train set: ~20,000 sentences
|
| 86 |
+
- Dev set: ~5,000 sentences
|
| 87 |
+
- Sources: Gazetteers (districts, tribes, flora/fauna, festivals, tourist sites, names), news articles, tourism/cultural descriptions
|
| 88 |
+
|
| 89 |
+
|
| 90 |
### ๐ง Environment
|
| 91 |
|
| 92 |
* **Transformers**: 4.44.2
|
|
|
|
| 132 |
|
| 133 |
---
|
| 134 |
|
| 135 |
+
### ๐ Data Licenses
|
| 136 |
+
- Gazetteers of villages and tribes: compiled by MWirelabs (open reference use).
|
| 137 |
+
- Festivals, tourist sites, and names: curated by MWirelabs team.
|
| 138 |
+
Please ensure attribution when reusing any derived dataset.
|
| 139 |
+
|
| 140 |
+
---
|
| 141 |
+
|
| 142 |
## ๐ Citation
|
| 143 |
|
| 144 |
If you use this model in your research, please cite:
|
|
|
|
| 155 |
|
| 156 |
---
|
| 157 |
|
| 158 |
+
## โ ๏ธ Limitations
|
| 159 |
+
- Low support for **TOURIST** and **FAUNA** classes (few examples).
|
| 160 |
+
- **NAMES** entity class trained but not evaluated due to lack of dev set coverage.
|
| 161 |
+
- Possible confusion between **TRIBES** and **PLACES** where names overlap (e.g., Garo).
|
| 162 |
+
- Model optimized for Northeast India texts; performance outside this domain may degrade.
|
| 163 |
+
|
| 164 |
+
## ๐ฎ Future Work
|
| 165 |
+
- Add more gold-labeled examples for underrepresented classes (Names, Fauna, Tourist).
|
| 166 |
+
- Explore active learning to identify low-confidence predictions for manual annotation.
|
| 167 |
+
- Expand coverage of festivals and indigenous knowledge domains.
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
## ๐ข About
|
| 172 |
|
| 173 |
This model is developed by **MWirelabs**, pioneering AI solutions for the rich cultural and linguistic diversity of **Northeast India**.
|