oliviermills commited on
Commit
41c491d
Β·
verified Β·
1 Parent(s): f0a8b65

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +169 -168
README.md CHANGED
@@ -1,203 +1,204 @@
1
  ---
2
- language: en
3
  license: cc-by-nc-4.0
 
4
  tags:
5
  - setfit
6
  - sentence-transformers
7
  - text-classification
8
- - generated_from_setfit_trainer
9
- widget:
10
- - text: Gaddafi cuts of water to Libya's capital
11
- - text: Grenade blast in water tank leaves 40 families without water in Potrerito,
12
- Valle del Cauca, Colombia
13
- - text: Silvan Dam construction site attacked
14
- - text: in the afternoon, US forces destroy (likely through airstrikes) 2 suspected
15
- Houthi patrol boats in an unidentified area in the South Red Sea while Houthi
16
- media reported 3 air raids on As Salif coastal district (coded to As Salif Port)
17
- (Al Hudaydah). Casaulties unknown.
18
- - text: a group of Fulani men clashed with and killed a suspected Fulani bull thief
19
- in the Goure Kele district of Sakabansi (Nikki, Borgou). He was found dead in
20
- his house after being struck with a machete during the clash by one of the members
21
- of the group, who then fled.
22
  metrics:
 
23
  - accuracy
24
- pipeline_tag: text-classification
25
- library_name: setfit
26
- inference: false
27
- base_model: BAAI/bge-small-en-v1.5
 
 
 
 
28
  ---
29
 
30
- # SetFit with BAAI/bge-small-en-v1.5
31
 
32
- This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) as the Sentence Transformer embedding model. A OneVsRestClassifier instance is used for classification.
33
 
34
- The model has been trained using an efficient few-shot learning technique that involves:
35
 
36
- 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
37
- 2. Training a classification head with features from the fine-tuned Sentence Transformer.
38
 
39
- ## Model Details
40
 
41
- ### Model Description
42
- - **Model Type:** SetFit
43
- - **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
44
- - **Classification head:** a OneVsRestClassifier instance
45
- - **Maximum Sequence Length:** 512 tokens
46
- - **Number of Classes:** 3 classes
47
- <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
48
- - **Language:** en
49
- - **License:** cc-by-nc-4.0
50
 
51
- ### Model Sources
 
 
52
 
53
- - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
54
- - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
55
- - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
56
 
57
- ## Uses
58
 
59
- ### Direct Use for Inference
 
 
 
 
 
 
 
60
 
61
- First install the SetFit library:
62
 
63
- ```bash
64
- pip install setfit
65
- ```
66
-
67
- Then you can load this model and run inference.
68
 
69
  ```python
70
  from setfit import SetFitModel
71
 
72
- # Download from the πŸ€— Hub
73
  model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
74
- # Run inference
75
- preds = model("Silvan Dam construction site attacked")
 
 
 
 
 
 
 
 
 
76
  ```
77
 
78
- <!--
79
- ### Downstream Use
80
-
81
- *List how someone could finetune this model on their own dataset.*
82
- -->
83
-
84
- <!--
85
- ### Out-of-Scope Use
86
-
87
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
88
- -->
89
-
90
- <!--
91
- ## Bias, Risks and Limitations
92
-
93
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
94
- -->
95
-
96
- <!--
97
- ### Recommendations
98
-
99
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
100
- -->
101
-
102
- ## Training Details
103
-
104
- ### Training Set Metrics
105
- | Training set | Min | Median | Max |
106
- |:-------------|:----|:--------|:----|
107
- | Word count | 4 | 25.9533 | 236 |
108
-
109
- ### Training Hyperparameters
110
- - batch_size: (32, 32)
111
- - num_epochs: (1, 1)
112
- - max_steps: -1
113
- - sampling_strategy: undersampling
114
- - body_learning_rate: (2e-05, 1e-05)
115
- - head_learning_rate: 0.01
116
- - loss: CosineSimilarityLoss
117
- - distance_metric: cosine_distance
118
- - margin: 0.25
119
- - end_to_end: False
120
- - use_amp: False
121
- - warmup_proportion: 0.1
122
- - l2_weight: 0.01
123
- - seed: 42
124
- - eval_max_steps: -1
125
- - load_best_model_at_end: True
126
-
127
- ### Training Results
128
- | Epoch | Step | Training Loss | Validation Loss |
129
- |:------:|:----:|:-------------:|:---------------:|
130
- | 0.0007 | 1 | 0.2168 | - |
131
- | 0.0339 | 50 | 0.2108 | - |
132
- | 0.0679 | 100 | 0.1126 | - |
133
- | 0.1018 | 150 | 0.0719 | - |
134
- | 0.1358 | 200 | 0.0616 | - |
135
- | 0.1697 | 250 | 0.0518 | - |
136
- | 0.2037 | 300 | 0.0454 | - |
137
- | 0.2376 | 350 | 0.0393 | - |
138
- | 0.2716 | 400 | 0.0324 | - |
139
- | 0.3055 | 450 | 0.0265 | - |
140
- | 0.3394 | 500 | 0.0279 | - |
141
- | 0.3734 | 550 | 0.0231 | - |
142
- | 0.4073 | 600 | 0.0231 | - |
143
- | 0.4413 | 650 | 0.0228 | - |
144
- | 0.4752 | 700 | 0.0272 | - |
145
- | 0.5092 | 750 | 0.0216 | - |
146
- | 0.5431 | 800 | 0.0186 | - |
147
- | 0.5771 | 850 | 0.0195 | - |
148
- | 0.6110 | 900 | 0.0174 | - |
149
- | 0.6449 | 950 | 0.0163 | - |
150
- | 0.6789 | 1000 | 0.0174 | - |
151
- | 0.7128 | 1050 | 0.0148 | - |
152
- | 0.7468 | 1100 | 0.0167 | - |
153
- | 0.7807 | 1150 | 0.0158 | - |
154
- | 0.8147 | 1200 | 0.0146 | - |
155
- | 0.8486 | 1250 | 0.0146 | - |
156
- | 0.8826 | 1300 | 0.0145 | - |
157
- | 0.9165 | 1350 | 0.0138 | - |
158
- | 0.9504 | 1400 | 0.0142 | - |
159
- | 0.9844 | 1450 | 0.013 | - |
160
- | 1.0 | 1473 | - | 0.0577 |
161
-
162
- ### Framework Versions
163
- - Python: 3.12.12
164
- - SetFit: 1.1.3
165
- - Sentence Transformers: 5.1.2
166
- - Transformers: 4.57.3
167
- - PyTorch: 2.9.1+cu128
168
- - Datasets: 4.4.1
169
- - Tokenizers: 0.22.1
170
-
171
- ## Citation
172
-
173
- ### BibTeX
174
- ```bibtex
175
- @article{https://doi.org/10.48550/arxiv.2209.11055,
176
- doi = {10.48550/ARXIV.2209.11055},
177
- url = {https://arxiv.org/abs/2209.11055},
178
- author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
179
- keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
180
- title = {Efficient Few-Shot Learning Without Prompts},
181
- publisher = {arXiv},
182
- year = {2022},
183
- copyright = {Creative Commons Attribution 4.0 International}
184
- }
185
  ```
186
 
187
- <!--
188
- ## Glossary
 
 
 
 
 
 
 
 
 
 
 
189
 
190
- *Clearly define terms in order to be accessible across audiences.*
191
- -->
 
 
 
 
 
192
 
193
- <!--
194
- ## Model Card Authors
195
 
196
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
197
- -->
 
 
 
198
 
199
- <!--
200
- ## Model Card Contact
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
201
 
202
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
203
- -->
 
1
  ---
 
2
  license: cc-by-nc-4.0
3
+ library_name: setfit
4
  tags:
5
  - setfit
6
  - sentence-transformers
7
  - text-classification
8
+ - multi-label
9
+ - water-conflict
 
 
 
 
 
 
 
 
 
 
 
 
10
  metrics:
11
+ - f1
12
  - accuracy
13
+ language:
14
+ - en
15
+ widget:
16
+ - text: "Taliban attack workers at the Kajaki Dam in Afghanistan"
17
+ - text: "Violent protests erupt over dam construction in Sudan"
18
+ - text: "New water treatment plant opens in California"
19
+ - text: "ISIS cuts off water supply to villages in Syria"
20
+ - text: "Government announces new irrigation subsidies"
21
  ---
22
 
23
+ # Water Conflict Multi-Label Classifier
24
 
25
+ ## πŸ”¬ Experimental Research
26
 
27
+ > **Note:** This is experimental research in support of the Pacific Institute's [Water Conflict Chronology](https://www.worldwater.org/water-conflict/) project, which tracks water-related conflicts spanning over 4,500 years of human history.
28
 
29
+ This model is designed to assist researchers in classifying water-related conflict events. The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history.
 
30
 
31
+ ## πŸ“‹ Model Description
32
 
33
+ This SetFit-based model classifies news headlines about water-related conflicts into three categories:
 
 
 
 
 
 
 
 
34
 
35
+ - **Trigger**: Water resource as a conflict trigger
36
+ - **Casualty**: Water infrastructure as a casualty/target
37
+ - **Weapon**: Water used as a weapon/tool
38
 
39
+ These categories align with the Pacific Institute's Water Conflict Chronology framework for understanding how water intersects with security and conflict.
 
 
40
 
41
+ ## πŸ—οΈ Model Details
42
 
43
+ - **Base Model**: BAAI/bge-small-en-v1.5 (33.4M parameters)
44
+ - **Architecture**: SetFit with One-vs-Rest multi-label strategy
45
+ - **Training Approach**: Few-shot learning optimized (SetFit reaches peak performance with small samples)
46
+ - **Training samples**: 600 (sampled from 4468 total training pool)
47
+ - **Test samples**: 789 (held-out, never seen during training)
48
+ - **Training time**: ~2-5 minutes on A10G GPU
49
+ - **Model size**: ~130MB
50
+ - **Inference speed**: ~5-10ms per headline on CPU
51
 
52
+ ## πŸ’» Usage
53
 
54
+ ### Quick Start
 
 
 
 
55
 
56
  ```python
57
  from setfit import SetFitModel
58
 
59
+ # Load the model
60
  model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
61
+
62
+ # Predict on headlines
63
+ headlines = [
64
+ "Taliban attack workers at the Kajaki Dam in Afghanistan",
65
+ "New water treatment plant opens in California"
66
+ ]
67
+
68
+ predictions = model.predict(headlines)
69
+ print(predictions)
70
+ # Output: [[1, 1, 0], [0, 0, 0]]
71
+ # Format: [Trigger, Casualty, Weapon]
72
  ```
73
 
74
+ ### Interpreting Results
75
+
76
+ The model returns a list of binary predictions for each label:
77
+
78
+ ```python
79
+ label_names = ['Trigger', 'Casualty', 'Weapon']
80
+
81
+ for headline, pred in zip(headlines, predictions):
82
+ labels = [label_names[i] for i, val in enumerate(pred) if val == 1]
83
+ print(f"Headline: {headline}")
84
+ print(f"Labels: {', '.join(labels) if labels else 'None'}")
85
+ print()
86
+ ```
87
+
88
+ ### Batch Processing
89
+
90
+ ```python
91
+ import pandas as pd
92
+
93
+ # Load your data
94
+ df = pd.read_csv("your_headlines.csv")
95
+
96
+ # Predict in batches
97
+ predictions = model.predict(df['headline'].tolist())
98
+
99
+ # Add predictions to dataframe
100
+ df['trigger'] = [p[0] for p in predictions]
101
+ df['casualty'] = [p[1] for p in predictions]
102
+ df['weapon'] = [p[2] for p in predictions]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
  ```
104
 
105
+ ### Example Outputs
106
+
107
+ | Headline | Trigger | Casualty | Weapon |
108
+ |----------|---------|----------|--------|
109
+ | "ISIS militants blow up water pipeline in Iraq" | βœ“ | βœ“ | βœ“ |
110
+ | "New water treatment plant opens in California" | βœ— | βœ— | βœ— |
111
+ | "Protests erupt over dam construction in Ethiopia" | βœ“ | βœ— | βœ— |
112
+
113
+ ## Evaluation Results
114
+
115
+ Evaluated on a held-out test set of 789 samples (15% of total data, stratified by label combinations).
116
+
117
+ ### Overall Performance
118
 
119
+ | Metric | Score |
120
+ |--------|-------|
121
+ | Exact Match Accuracy | 0.9024 |
122
+ | Hamming Loss | 0.0469 |
123
+ | F1 (micro) | 0.8754 |
124
+ | F1 (macro) | 0.8134 |
125
+ | F1 (samples) | 0.4647 |
126
 
127
+ ### Per-Label Performance
 
128
 
129
+ | Label | Precision | Recall | F1 | Support |
130
+ |-------|-----------|--------|-----|---------|
131
+ | Trigger | 0.9623 | 0.8844 | 0.9217 | 173 |
132
+ | Casualty | 0.8819 | 0.8970 | 0.8894 | 233 |
133
+ | Weapon | 0.7568 | 0.5385 | 0.6292 | 52 |
134
 
135
+ ### Training Details
136
+
137
+ - **Training samples**: 600 examples
138
+ - **Test samples**: 789 examples (held-out before sampling)
139
+ - **Base model**: BAAI/bge-small-en-v1.5 (33.4M params)
140
+ - **Batch size**: 32
141
+ - **Epochs**: 1
142
+ - **Sampling strategy**: undersampling (balances positive/negative pairs)
143
+
144
+ ## πŸ“Š Data Sources
145
+
146
+ ### Positive Examples (Water Conflict Headlines)
147
+ Pacific Institute (2025). *Water Conflict Chronology*. Pacific Institute, Oakland, CA.
148
+ https://www.worldwater.org/water-conflict/
149
+
150
+ ### Negative Examples (Non-Water Conflict Headlines)
151
+ Armed Conflict Location & Event Data Project (ACLED).
152
+ https://acleddata.com/
153
+
154
+ ## 🌍 About This Project
155
+
156
+ This model is part of experimental research supporting the Pacific Institute's Water Conflict Chronology project. The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history.
157
+
158
+ Learn more: https://www.worldwater.org/water-conflict/
159
+
160
+ ## πŸ“œ License
161
+
162
+ Copyright Β© 2025 Baobab Tech
163
+
164
+ This work is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License](http://creativecommons.org/licenses/by-nc/4.0/).
165
+
166
+ **You are free to:**
167
+ - **Share** β€” copy and redistribute the material in any medium or format
168
+ - **Adapt** β€” remix, transform, and build upon the material
169
+
170
+ **Under the following terms:**
171
+ - **Attribution** β€” You must give appropriate credit to Baobab Tech, provide a link to the license, and indicate if changes were made
172
+ - **NonCommercial** β€” You may not use the material for commercial purposes
173
+
174
+ For commercial licensing inquiries, please contact Baobab Tech.
175
+
176
+ ## πŸ“ Citation
177
+
178
+ If you use this model in your work, please cite:
179
+
180
+ ```bibtex
181
+ @misc{waterconflict2025,
182
+ title={Water Conflict Multi-Label Classifier},
183
+ author={Experimental Research Supporting Pacific Institute Water Conflict Chronology},
184
+ year={2025},
185
+ howpublished={\url{https://huggingface.co/baobabtech/water-conflict-classifier}},
186
+ note={Training data from Pacific Institute Water Conflict Chronology and ACLED}
187
+ }
188
+ ```
189
+
190
+ Please also cite the Pacific Institute's Water Conflict Chronology:
191
+
192
+ ```bibtex
193
+ @misc{pacificinstitute2025,
194
+ title={Water Conflict Chronology},
195
+ author={Pacific Institute},
196
+ year={2025},
197
+ address={Oakland, CA},
198
+ url={https://www.worldwater.org/water-conflict/},
199
+ note={Accessed: [access date]}
200
+ }
201
+ ```
202
 
203
+ **Recommended citation format:**
204
+ Pacific Institute (2025) Water Conflict Chronology. Pacific Institute, Oakland, CA. https://www.worldwater.org/water-conflict/. Accessed: (access date).