oliviermills commited on
Commit
baf78f2
·
verified ·
1 Parent(s): 6f900e2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +107 -164
README.md CHANGED
@@ -1,201 +1,144 @@
1
  ---
 
 
2
  tags:
3
  - setfit
4
  - sentence-transformers
5
  - text-classification
6
- - generated_from_setfit_trainer
7
- widget:
8
- - text: Gaddafi cuts of water to Libya's capital
9
- - text: Grenade blast in water tank leaves 40 families without water in Potrerito,
10
- Valle del Cauca, Colombia
11
- - text: Silvan Dam construction site attacked
12
- - text: in the afternoon, US forces destroy (likely through airstrikes) 2 suspected
13
- Houthi patrol boats in an unidentified area in the South Red Sea while Houthi
14
- media reported 3 air raids on As Salif coastal district (coded to As Salif Port)
15
- (Al Hudaydah). Casaulties unknown.
16
- - text: a group of Fulani men clashed with and killed a suspected Fulani bull thief
17
- in the Goure Kele district of Sakabansi (Nikki, Borgou). He was found dead in
18
- his house after being struck with a machete during the clash by one of the members
19
- of the group, who then fled.
20
  metrics:
 
21
  - accuracy
22
- pipeline_tag: text-classification
23
- library_name: setfit
24
- inference: false
25
- base_model: BAAI/bge-small-en-v1.5
26
  ---
27
 
28
- # SetFit with BAAI/bge-small-en-v1.5
29
 
30
- This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) as the Sentence Transformer embedding model. A OneVsRestClassifier instance is used for classification.
31
 
32
- The model has been trained using an efficient few-shot learning technique that involves:
33
 
34
- 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
35
- 2. Training a classification head with features from the fine-tuned Sentence Transformer.
36
 
37
- ## Model Details
38
 
39
- ### Model Description
40
- - **Model Type:** SetFit
41
- - **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
42
- - **Classification head:** a OneVsRestClassifier instance
43
- - **Maximum Sequence Length:** 512 tokens
44
- - **Number of Classes:** 3 classes
45
- <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
46
- <!-- - **Language:** Unknown -->
47
- <!-- - **License:** Unknown -->
48
 
49
- ### Model Sources
 
 
50
 
51
- - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
52
- - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
53
- - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
54
 
55
- ## Uses
 
 
 
56
 
57
- ### Direct Use for Inference
58
 
59
- First install the SetFit library:
60
 
61
- ```bash
62
- pip install setfit
63
- ```
 
 
 
64
 
65
- Then you can load this model and run inference.
66
 
67
  ```python
68
  from setfit import SetFitModel
69
 
70
- # Download from the 🤗 Hub
71
  model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
72
- # Run inference
73
- preds = model("Silvan Dam construction site attacked")
 
 
 
 
 
 
74
  ```
75
 
76
- <!--
77
- ### Downstream Use
78
-
79
- *List how someone could finetune this model on their own dataset.*
80
- -->
81
-
82
- <!--
83
- ### Out-of-Scope Use
84
-
85
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
86
- -->
87
-
88
- <!--
89
- ## Bias, Risks and Limitations
90
-
91
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
92
- -->
93
-
94
- <!--
95
- ### Recommendations
96
-
97
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
98
- -->
99
-
100
- ## Training Details
101
-
102
- ### Training Set Metrics
103
- | Training set | Min | Median | Max |
104
- |:-------------|:----|:--------|:----|
105
- | Word count | 4 | 25.9533 | 236 |
106
-
107
- ### Training Hyperparameters
108
- - batch_size: (32, 32)
109
- - num_epochs: (1, 1)
110
- - max_steps: -1
111
- - sampling_strategy: undersampling
112
- - body_learning_rate: (2e-05, 1e-05)
113
- - head_learning_rate: 0.01
114
- - loss: CosineSimilarityLoss
115
- - distance_metric: cosine_distance
116
- - margin: 0.25
117
- - end_to_end: False
118
- - use_amp: False
119
- - warmup_proportion: 0.1
120
- - l2_weight: 0.01
121
- - seed: 42
122
- - eval_max_steps: -1
123
- - load_best_model_at_end: True
124
-
125
- ### Training Results
126
- | Epoch | Step | Training Loss | Validation Loss |
127
- |:------:|:----:|:-------------:|:---------------:|
128
- | 0.0007 | 1 | 0.2168 | - |
129
- | 0.0339 | 50 | 0.2108 | - |
130
- | 0.0679 | 100 | 0.1126 | - |
131
- | 0.1018 | 150 | 0.0719 | - |
132
- | 0.1358 | 200 | 0.0616 | - |
133
- | 0.1697 | 250 | 0.0518 | - |
134
- | 0.2037 | 300 | 0.0454 | - |
135
- | 0.2376 | 350 | 0.0393 | - |
136
- | 0.2716 | 400 | 0.0324 | - |
137
- | 0.3055 | 450 | 0.0265 | - |
138
- | 0.3394 | 500 | 0.0279 | - |
139
- | 0.3734 | 550 | 0.0231 | - |
140
- | 0.4073 | 600 | 0.0231 | - |
141
- | 0.4413 | 650 | 0.0228 | - |
142
- | 0.4752 | 700 | 0.0272 | - |
143
- | 0.5092 | 750 | 0.0216 | - |
144
- | 0.5431 | 800 | 0.0186 | - |
145
- | 0.5771 | 850 | 0.0195 | - |
146
- | 0.6110 | 900 | 0.0174 | - |
147
- | 0.6449 | 950 | 0.0163 | - |
148
- | 0.6789 | 1000 | 0.0174 | - |
149
- | 0.7128 | 1050 | 0.0148 | - |
150
- | 0.7468 | 1100 | 0.0167 | - |
151
- | 0.7807 | 1150 | 0.0158 | - |
152
- | 0.8147 | 1200 | 0.0146 | - |
153
- | 0.8486 | 1250 | 0.0146 | - |
154
- | 0.8826 | 1300 | 0.0145 | - |
155
- | 0.9165 | 1350 | 0.0138 | - |
156
- | 0.9504 | 1400 | 0.0142 | - |
157
- | 0.9844 | 1450 | 0.013 | - |
158
- | 1.0 | 1473 | - | 0.0577 |
159
-
160
- ### Framework Versions
161
- - Python: 3.12.12
162
- - SetFit: 1.1.3
163
- - Sentence Transformers: 5.1.2
164
- - Transformers: 4.57.3
165
- - PyTorch: 2.9.1+cu128
166
- - Datasets: 4.4.1
167
- - Tokenizers: 0.22.1
168
 
169
  ## Citation
170
 
171
- ### BibTeX
 
172
  ```bibtex
173
- @article{https://doi.org/10.48550/arxiv.2209.11055,
174
- doi = {10.48550/ARXIV.2209.11055},
175
- url = {https://arxiv.org/abs/2209.11055},
176
- author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
177
- keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
178
- title = {Efficient Few-Shot Learning Without Prompts},
179
- publisher = {arXiv},
180
- year = {2022},
181
- copyright = {Creative Commons Attribution 4.0 International}
182
  }
183
  ```
184
 
185
- <!--
186
- ## Glossary
187
 
188
- *Clearly define terms in order to be accessible across audiences.*
189
- -->
190
-
191
- <!--
192
- ## Model Card Authors
193
-
194
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
195
- -->
196
-
197
- <!--
198
- ## Model Card Contact
199
 
200
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
201
- -->
 
1
  ---
2
+ license: cc-by-nc-4.0
3
+ library_name: setfit
4
  tags:
5
  - setfit
6
  - sentence-transformers
7
  - text-classification
8
+ - multi-label
9
+ - water-conflict
 
 
 
 
 
 
 
 
 
 
 
 
10
  metrics:
11
+ - f1
12
  - accuracy
13
+ language:
14
+ - en
 
 
15
  ---
16
 
17
+ # Water Conflict Multi-Label Classifier
18
 
19
+ > **Note:** This is experimental research in support of the Pacific Institute's [Water Conflict Chronology](https://www.worldwater.org/water-conflict/) project, which tracks water-related conflicts spanning over 4,500 years of human history.
20
 
21
+ ## License & Attribution
22
 
23
+ Copyright © 2025 Baobab Tech
 
24
 
25
+ This work is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License](http://creativecommons.org/licenses/by-nc/4.0/).
26
 
27
+ **You are free to:**
28
+ - **Share** copy and redistribute the material in any medium or format
29
+ - **Adapt** remix, transform, and build upon the material
 
 
 
 
 
 
30
 
31
+ **Under the following terms:**
32
+ - **Attribution** — You must give appropriate credit to Baobab Tech, provide a link to the license, and indicate if changes were made
33
+ - **NonCommercial** — You may not use the material for commercial purposes
34
 
35
+ For commercial licensing inquiries, please contact Baobab Tech.
 
 
36
 
37
+ This model classifies news headlines about water-related conflicts into three categories:
38
+ - **Trigger**: Water resource as a conflict trigger
39
+ - **Casualty**: Water infrastructure as a casualty/target
40
+ - **Weapon**: Water used as a weapon/tool
41
 
42
+ These categories align with the Pacific Institute's Water Conflict Chronology framework for understanding how water intersects with security and conflict.
43
 
44
+ ## Model Details
45
 
46
+ - **Base Model**: BAAI/bge-small-en-v1.5
47
+ - **Architecture**: SetFit with One-vs-Rest multi-label strategy
48
+ - **Training Approach**: Few-shot learning optimized (SetFit reaches peak performance with small samples)
49
+ - **Training samples**: 600 (sampled from 4468 total training pool)
50
+ - **Test samples**: 789 (held-out, never seen during training)
51
+ - **Training time**: ~2-5 minutes on A10G GPU
52
 
53
+ ## Usage
54
 
55
  ```python
56
  from setfit import SetFitModel
57
 
 
58
  model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
59
+
60
+ headlines = [
61
+ "Taliban attack workers at the Kajaki Dam in Afghanistan",
62
+ "New water treatment plant opens in California"
63
+ ]
64
+
65
+ predictions = model.predict(headlines)
66
+ print(predictions)
67
  ```
68
 
69
+ ## Evaluation Results
70
+
71
+ Evaluated on a held-out test set of 789 samples (15% of total data, stratified by label combinations).
72
+
73
+ ### Overall Performance
74
+
75
+ | Metric | Score |
76
+ |--------|-------|
77
+ | Exact Match Accuracy | 0.9024 |
78
+ | Hamming Loss | 0.0469 |
79
+ | F1 (micro) | 0.8754 |
80
+ | F1 (macro) | 0.8134 |
81
+ | F1 (samples) | 0.4647 |
82
+
83
+ ### Per-Label Performance
84
+
85
+ | Label | Precision | Recall | F1 | Support |
86
+ |-------|-----------|--------|-----|---------|
87
+ | Trigger | 0.9623 | 0.8844 | 0.9217 | 173 |
88
+ | Casualty | 0.8819 | 0.8970 | 0.8894 | 233 |
89
+ | Weapon | 0.7568 | 0.5385 | 0.6292 | 52 |
90
+
91
+ ### Training Details
92
+
93
+ - **Training samples**: 600 examples
94
+ - **Test samples**: 789 examples (held-out before sampling)
95
+ - **Base model**: BAAI/bge-small-en-v1.5 (33.4M params)
96
+ - **Batch size**: 32
97
+ - **Epochs**: 1
98
+ - **Sampling strategy**: undersampling (balances positive/negative pairs)
99
+
100
+ ## Data Sources
101
+
102
+ ### Positive Examples (Water Conflict Headlines)
103
+ Pacific Institute (2025). *Water Conflict Chronology*. Pacific Institute, Oakland, CA.
104
+ https://www.worldwater.org/water-conflict/
105
+
106
+ ### Negative Examples (Non-Water Conflict Headlines)
107
+ Armed Conflict Location & Event Data Project (ACLED).
108
+ https://acleddata.com/
109
+
110
+ ## About This Project
111
+
112
+ This model is part of experimental research supporting the Pacific Institute's Water Conflict Chronology project. The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history.
113
+
114
+ Learn more: https://www.worldwater.org/water-conflict/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
  ## Citation
117
 
118
+ If you use this model in your work, please cite:
119
+
120
  ```bibtex
121
+ @misc{waterconflict2025,
122
+ title={Water Conflict Multi-Label Classifier},
123
+ author={Experimental Research Supporting Pacific Institute Water Conflict Chronology},
124
+ year={2025},
125
+ howpublished={\url{https://huggingface.co/baobabtech/water-conflict-classifier}},
126
+ note={Training data from Pacific Institute Water Conflict Chronology and ACLED}
 
 
 
127
  }
128
  ```
129
 
130
+ Please also cite the Pacific Institute's Water Conflict Chronology:
 
131
 
132
+ ```bibtex
133
+ @misc{pacificinstitute2025,
134
+ title={Water Conflict Chronology},
135
+ author={Pacific Institute},
136
+ year={2025},
137
+ address={Oakland, CA},
138
+ url={https://www.worldwater.org/water-conflict/},
139
+ note={Accessed: [access date]}
140
+ }
141
+ ```
 
142
 
143
+ **Recommended citation format:**
144
+ Pacific Institute (2025) Water Conflict Chronology. Pacific Institute, Oakland, CA. https://www.worldwater.org/water-conflict/. Accessed: (access date).