Badnyal commited on
Commit
dd5dd7a
ยท
verified ยท
1 Parent(s): 572aeff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -0
README.md CHANGED
@@ -66,6 +66,7 @@ Evaluated on a 5k-sentence dev set:
66
  | **Overall** | **0.962** | **0.967** | **0.964** |
67
 
68
  โš ๏ธ Low scores for **TOURIST / FAUNA** due to very few training examples โ€” performance will improve with more labeled data.
 
69
 
70
  ---
71
 
@@ -80,6 +81,12 @@ Evaluated on a 5k-sentence dev set:
80
  * **Optimizer**: AdamW
81
  * **Framework**: HuggingFace Transformers Trainer API
82
 
 
 
 
 
 
 
83
  ### ๐Ÿ”ง Environment
84
 
85
  * **Transformers**: 4.44.2
@@ -125,6 +132,13 @@ You are free to use, share, and adapt the model for non-commercial purposes with
125
 
126
  ---
127
 
 
 
 
 
 
 
 
128
  ## ๐Ÿ“– Citation
129
 
130
  If you use this model in your research, please cite:
@@ -141,6 +155,19 @@ If you use this model in your research, please cite:
141
 
142
  ---
143
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
  ## ๐Ÿข About
145
 
146
  This model is developed by **MWirelabs**, pioneering AI solutions for the rich cultural and linguistic diversity of **Northeast India**.
 
66
  | **Overall** | **0.962** | **0.967** | **0.964** |
67
 
68
  โš ๏ธ Low scores for **TOURIST / FAUNA** due to very few training examples โ€” performance will improve with more labeled data.
69
+ Note: The current evaluation set does not include enough examples of **NAMES**, so that category is not reported in the table. Training data did include a small gazetteer of Khasi and regional names (~81 entries), but more labeled examples are needed for meaningful evaluation.
70
 
71
  ---
72
 
 
81
  * **Optimizer**: AdamW
82
  * **Framework**: HuggingFace Transformers Trainer API
83
 
84
+ ### ๐Ÿ“ฆ Dataset Size
85
+ - Train set: ~20,000 sentences
86
+ - Dev set: ~5,000 sentences
87
+ - Sources: Gazetteers (districts, tribes, flora/fauna, festivals, tourist sites, names), news articles, tourism/cultural descriptions
88
+
89
+
90
  ### ๐Ÿ”ง Environment
91
 
92
  * **Transformers**: 4.44.2
 
132
 
133
  ---
134
 
135
+ ### ๐Ÿ—‚ Data Licenses
136
+ - Gazetteers of villages and tribes: compiled by MWirelabs (open reference use).
137
+ - Festivals, tourist sites, and names: curated by MWirelabs team.
138
+ Please ensure attribution when reusing any derived dataset.
139
+
140
+ ---
141
+
142
  ## ๐Ÿ“– Citation
143
 
144
  If you use this model in your research, please cite:
 
155
 
156
  ---
157
 
158
+ ## โš ๏ธ Limitations
159
+ - Low support for **TOURIST** and **FAUNA** classes (few examples).
160
+ - **NAMES** entity class trained but not evaluated due to lack of dev set coverage.
161
+ - Possible confusion between **TRIBES** and **PLACES** where names overlap (e.g., Garo).
162
+ - Model optimized for Northeast India texts; performance outside this domain may degrade.
163
+
164
+ ## ๐Ÿ”ฎ Future Work
165
+ - Add more gold-labeled examples for underrepresented classes (Names, Fauna, Tourist).
166
+ - Explore active learning to identify low-confidence predictions for manual annotation.
167
+ - Expand coverage of festivals and indigenous knowledge domains.
168
+
169
+ ---
170
+
171
  ## ๐Ÿข About
172
 
173
  This model is developed by **MWirelabs**, pioneering AI solutions for the rich cultural and linguistic diversity of **Northeast India**.