oliviermills commited on
Commit
78e4134
Β·
verified Β·
1 Parent(s): 6505182

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +234 -159
README.md CHANGED
@@ -1,197 +1,272 @@
1
  ---
2
- language: en
3
  license: cc-by-nc-4.0
 
4
  tags:
5
  - setfit
6
  - sentence-transformers
7
  - text-classification
8
- - generated_from_setfit_trainer
9
- widget:
10
- - text: Scientists use satellite data to track changes in groundwater levels across
11
- continent
12
- - text: in Aye Chan Thar San Pya village (Tatkon township, Oke Ta Ra district, Nay
13
- Pyi Taw), the Myanmar military arrived in the village at around 3 am to conscript
14
- a man who was selected for military service through the draw. Subsequently, the
15
- military arrested his 15-year-old brother as a substitute as the man had escaped
16
- to Thailand.
17
- - text: 600 people, including members of the Constitution Cooperation Centre, Zenroren,
18
- Peace Boat, Anti-War Committee of 1000, Christian Peace Network, Women's Democratic
19
- Club, and Fudanren, Diet members form the JCP and SDPJ, protested in front of
20
- the Diet in Tokyo - Chiyoda to call for an investigation into Prime Minister Shigeru
21
- Ishiba's voucher distribution scandal and demanded a halt to large-scale military
22
- expansion. Protesters also advocated for the introduction of optional separate
23
- surnames for married couples and called for full implementation of the current
24
- constitution. The Don't Allow War! Don't Break Article 9! Total Action Executive
25
- Committee and NO! To Article 9 Amendment - Citizens Action Shiga organized the
26
- event.
27
- - text: One person killed in explosion of water well mined with an IED by JNIM militants
28
- in Sorga, Burkina Faso
29
- - text: Water wells are poisoned during the US Civil War
30
  metrics:
 
31
  - accuracy
32
- pipeline_tag: text-classification
33
- library_name: setfit
34
- inference: false
35
- base_model: sentence-transformers/all-MiniLM-L6-v2
 
 
 
 
36
  ---
37
 
38
- # SetFit with sentence-transformers/all-MiniLM-L6-v2
 
 
 
 
 
 
 
 
 
 
39
 
40
- This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) as the Sentence Transformer embedding model. A OneVsRestClassifier instance is used for classification.
41
 
42
- The model has been trained using an efficient few-shot learning technique that involves:
43
 
44
- 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
45
- 2. Training a classification head with features from the fine-tuned Sentence Transformer.
 
 
 
46
 
47
- ## Model Details
48
 
49
- ### Model Description
50
- - **Model Type:** SetFit
51
- - **Sentence Transformer body:** [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
52
- - **Classification head:** a OneVsRestClassifier instance
53
- - **Maximum Sequence Length:** 256 tokens
54
- - **Number of Classes:** 3 classes
55
- <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
56
- - **Language:** en
57
- - **License:** cc-by-nc-4.0
58
 
59
- ### Model Sources
60
 
61
- - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
62
- - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
63
- - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
64
 
65
- ## Uses
66
 
67
- ### Direct Use for Inference
68
 
69
- First install the SetFit library:
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ```bash
72
- pip install setfit
73
  ```
74
 
75
- Then you can load this model and run inference.
 
 
 
 
 
 
 
 
 
76
 
77
  ```python
78
  from setfit import SetFitModel
79
 
80
- # Download from the πŸ€— Hub
81
  model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier-minilm")
82
- # Run inference
83
- preds = model("Water wells are poisoned during the US Civil War")
 
 
 
 
 
 
 
 
 
84
  ```
85
 
86
- <!--
87
- ### Downstream Use
88
-
89
- *List how someone could finetune this model on their own dataset.*
90
- -->
91
-
92
- <!--
93
- ### Out-of-Scope Use
94
-
95
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
96
- -->
97
-
98
- <!--
99
- ## Bias, Risks and Limitations
100
-
101
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
102
- -->
103
-
104
- <!--
105
- ### Recommendations
106
-
107
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
108
- -->
109
-
110
- ## Training Details
111
-
112
- ### Training Set Metrics
113
- | Training set | Min | Median | Max |
114
- |:-------------|:----|:--------|:----|
115
- | Word count | 4 | 24.2225 | 154 |
116
-
117
- ### Training Hyperparameters
118
- - batch_size: (64, 64)
119
- - num_epochs: (1, 1)
120
- - max_steps: -1
121
- - sampling_strategy: undersampling
122
- - num_iterations: 20
123
- - body_learning_rate: (2e-05, 1e-05)
124
- - head_learning_rate: 0.01
125
- - loss: CosineSimilarityLoss
126
- - distance_metric: cosine_distance
127
- - margin: 0.25
128
- - end_to_end: False
129
- - use_amp: False
130
- - warmup_proportion: 0.1
131
- - l2_weight: 0.01
132
- - seed: 42
133
- - eval_max_steps: -1
134
- - load_best_model_at_end: True
135
-
136
- ### Training Results
137
- | Epoch | Step | Training Loss | Validation Loss |
138
- |:------:|:----:|:-------------:|:---------------:|
139
- | 0.0013 | 1 | 0.2372 | - |
140
- | 0.0667 | 50 | 0.198 | - |
141
- | 0.1333 | 100 | 0.0912 | - |
142
- | 0.2 | 150 | 0.0688 | - |
143
- | 0.2667 | 200 | 0.0586 | - |
144
- | 0.3333 | 250 | 0.0585 | - |
145
- | 0.4 | 300 | 0.0566 | - |
146
- | 0.4667 | 350 | 0.0504 | - |
147
- | 0.5333 | 400 | 0.0469 | - |
148
- | 0.6 | 450 | 0.0458 | - |
149
- | 0.6667 | 500 | 0.0479 | - |
150
- | 0.7333 | 550 | 0.0416 | - |
151
- | 0.8 | 600 | 0.0393 | - |
152
- | 0.8667 | 650 | 0.0404 | - |
153
- | 0.9333 | 700 | 0.0384 | - |
154
- | 1.0 | 750 | 0.0383 | 0.1005 |
155
-
156
- ### Framework Versions
157
- - Python: 3.12.12
158
- - SetFit: 1.1.3
159
- - Sentence Transformers: 5.1.2
160
- - Transformers: 4.57.3
161
- - PyTorch: 2.9.1+cu128
162
- - Datasets: 4.4.1
163
- - Tokenizers: 0.22.1
164
-
165
- ## Citation
166
-
167
- ### BibTeX
168
- ```bibtex
169
- @article{https://doi.org/10.48550/arxiv.2209.11055,
170
- doi = {10.48550/ARXIV.2209.11055},
171
- url = {https://arxiv.org/abs/2209.11055},
172
- author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
173
- keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
174
- title = {Efficient Few-Shot Learning Without Prompts},
175
- publisher = {arXiv},
176
- year = {2022},
177
- copyright = {Creative Commons Attribution 4.0 International}
178
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
  ```
180
 
181
- <!--
182
- ## Glossary
 
 
 
 
 
183
 
184
- *Clearly define terms in order to be accessible across audiences.*
185
- -->
 
186
 
187
- <!--
188
- ## Model Card Authors
 
189
 
190
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
191
- -->
192
 
193
- <!--
194
- ## Model Card Contact
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
 
196
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
197
- -->
 
1
  ---
 
2
  license: cc-by-nc-4.0
3
+ library_name: setfit
4
  tags:
5
  - setfit
6
  - sentence-transformers
7
  - text-classification
8
+ - multi-label
9
+ - water-conflict
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  metrics:
11
+ - f1
12
  - accuracy
13
+ language:
14
+ - en
15
+ widget:
16
+ - text: "Taliban attack workers at the Kajaki Dam in Afghanistan"
17
+ - text: "Violent protests erupt over dam construction in Sudan"
18
+ - text: "New water treatment plant opens in California"
19
+ - text: "ISIS cuts off water supply to villages in Syria"
20
+ - text: "Government announces new irrigation subsidies"
21
  ---
22
 
23
+ # Water Conflict Multi-Label Classifier
24
+
25
+ ## πŸ”¬ Experimental Research
26
+
27
+ > **Note:** This experimental research draws on Pacific Institute's [Water Conflict Chronology](https://www.worldwater.org/water-conflict/), which tracks water-related conflicts spanning over 4,500 years of human history. The work is conducted independently and is not affiliated with Pacific Institute.
28
+
29
+ This model is designed to assist researchers in classifying water-related conflict events.
30
+
31
+ The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history. This is not a commercial product and is not intended for commercial use.
32
+
33
+ ## 🌱 Frugal AI: Training with Limited Data
34
 
35
+ This classifier demonstrates an intentional approach to building AI systems with **limited data** using [SetFit](https://huggingface.co/docs/setfit/en/index) - a framework for few-shot learning with sentence transformers. Rather than defaulting to massive language models (GPT, Claude, or 100B+ parameter models) for simple classification tasks, we fine-tune small, efficient models (e.g., BAAI/bge-small-en-v1.5 with ~33M parameters) on a focused dataset.
36
 
37
+ **Why this matters:** The industry has normalized using trillion-parameter models to classify headlines, answer simple questions, or categorize text - tasks that don't require world knowledge, reasoning, or generative capabilities. This is computationally wasteful and environmentally costly. A properly fine-tuned small model can achieve comparable or better accuracy while using a fraction of the compute resources.
38
 
39
+ **Our approach:**
40
+ - Train on ~600 examples (few-shot learning with SetFit)
41
+ - Deploy small parameter models (e.g., ~33M params) vs. 100B-1T parameter alternatives
42
+ - Achieve specialized task performance without the overhead of general-purpose LLMs
43
+ - Reduce inference costs and latency by orders of magnitude
44
 
45
+ This is not about avoiding large models altogether - they're invaluable for complex reasoning tasks. But for targeted classification problems with labeled data, fine-tuning remains the professional, responsible choice.
46
 
47
+ ## πŸ“‹ Model Description
 
 
 
 
 
 
 
 
48
 
49
+ This SetFit-based model classifies news headlines about water-related conflicts into three categories:
50
 
51
+ - **Trigger**: Water resource as a conflict trigger
52
+ - **Casualty**: Water infrastructure as a casualty/target
53
+ - **Weapon**: Water used as a weapon/tool
54
 
55
+ These categories align with the Pacific Institute's Water Conflict Chronology framework for understanding how water intersects with security and conflict.
56
 
57
+ ## πŸ—οΈ Model Details
58
 
59
+ - **Base Model**: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
60
+ - **Architecture**: SetFit with One-vs-Rest multi-label strategy
61
+ - **Training Approach**: Few-shot learning optimized (SetFit reaches peak performance with small samples)
62
+ - **Training samples**: 1200 (sampled from 2937 total training pool)
63
+ - **Test samples**: 519 (held-out, never seen during training)
64
+ - **Training time**: ~2-5 minutes on A10G GPU
65
+ - **Model size**: 23M Parameters, ~90MB
66
+ - **Inference speed**: ~5-10ms per headline on CPU
67
+
68
+ ## πŸ’» Usage
69
+
70
+ ### Installation
71
+
72
+ The training code is published as a Python package on PyPI:
73
 
74
  ```bash
75
+ pip install water-conflict-classifier
76
  ```
77
 
78
+ **Package includes:**
79
+ - Data preprocessing utilities
80
+ - Training logic (SetFit multi-label)
81
+ - Evaluation metrics
82
+ - Model card generation
83
+
84
+ **Source code:** https://github.com/baobabtech/waterconflict/tree/main/classifier
85
+ **PyPI:** https://pypi.org/project/water-conflict-classifier/
86
+
87
+ ### Quick Start
88
 
89
  ```python
90
  from setfit import SetFitModel
91
 
92
+ # Load the trained model from HF Hub
93
  model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier-minilm")
94
+
95
+ # Predict on headlines
96
+ headlines = [
97
+ "Taliban attack workers at the Kajaki Dam in Afghanistan",
98
+ "New water treatment plant opens in California"
99
+ ]
100
+
101
+ predictions = model.predict(headlines)
102
+ print(predictions)
103
+ # Output: [[1, 1, 0], [0, 0, 0]]
104
+ # Format: [Trigger, Casualty, Weapon]
105
  ```
106
 
107
+ ### Interpreting Results
108
+
109
+ The model returns a list of binary predictions for each label:
110
+
111
+ ```python
112
+ label_names = ['Trigger', 'Casualty', 'Weapon']
113
+
114
+ for headline, pred in zip(headlines, predictions):
115
+ labels = [label_names[i] for i, val in enumerate(pred) if val == 1]
116
+ print(f"Headline: {headline}")
117
+ print(f"Labels: {', '.join(labels) if labels else 'None'}")
118
+ print()
119
+ ```
120
+
121
+ ### Batch Processing
122
+
123
+ ```python
124
+ import pandas as pd
125
+
126
+ # Load your data
127
+ df = pd.read_csv("your_headlines.csv")
128
+
129
+ # Predict in batches
130
+ predictions = model.predict(df['headline'].tolist())
131
+
132
+ # Add predictions to dataframe
133
+ df['trigger'] = [p[0] for p in predictions]
134
+ df['casualty'] = [p[1] for p in predictions]
135
+ df['weapon'] = [p[2] for p in predictions]
136
+ ```
137
+
138
+ ### Example Outputs
139
+
140
+ | Headline | Trigger | Casualty | Weapon |
141
+ |----------|---------|----------|--------|
142
+ | "ISIS militants blow up water pipeline in Iraq" | βœ“ | βœ“ | βœ“ |
143
+ | "New water treatment plant opens in California" | βœ— | βœ— | βœ— |
144
+ | "Protests erupt over dam construction in Ethiopia" | βœ“ | βœ— | βœ— |
145
+
146
+ ## Evaluation Results
147
+
148
+ Evaluated on a held-out test set of 519 samples (15% of total data, stratified by label combinations).
149
+
150
+ ### Overall Performance
151
+
152
+ | Metric | Score |
153
+ |--------|-------|
154
+ | Exact Match Accuracy | 0.8382 |
155
+ | Hamming Loss | 0.0745 |
156
+ | F1 (micro) | 0.8731 |
157
+ | F1 (macro) | 0.7857 |
158
+ | F1 (samples) | 0.7039 |
159
+
160
+ ### Per-Label Performance
161
+
162
+ | Label | Precision | Recall | F1 | Support |
163
+ |-------|-----------|--------|-----|---------|
164
+ | Trigger | 0.8914 | 0.9017 | 0.8966 | 173 |
165
+ | Casualty | 0.8835 | 0.9442 | 0.9129 | 233 |
166
+ | Weapon | 0.7187 | 0.4423 | 0.5476 | 52 |
167
+
168
+ ### Training Details
169
+
170
+ - **Training samples**: 1200 examples
171
+ - **Test samples**: 519 examples (held-out before sampling)
172
+ - **Base model**: sentence-transformers/all-MiniLM-L6-v2 (23M params)
173
+ - **Batch size**: 64
174
+ - **Epochs**: 1
175
+ - **Iterations**: 20 (contrastive pair generation)
176
+ - **Sampling strategy**: undersampling (balances positive/negative pairs)
177
+
178
+ ### πŸ“ˆ Experiment Tracking
179
+
180
+ All training runs are automatically tracked in a public dataset for experiment comparison:
181
+
182
+ - **Evals Dataset**: [baobabtech/water-conflict-classifier-evals](https://huggingface.co/datasets/baobabtech/water-conflict-classifier-evals)
183
+ - **Tracked Metrics**: F1 scores, accuracy, per-label performance, and all hyperparameters
184
+ - **Compare Experiments**: View how different configurations (sample size, epochs, batch size) affect performance
185
+ - **Reproducibility**: Full training configs logged for each version
186
+
187
+ You can explore past experiments and compare model performance across versions using the evals dataset.
188
+
189
+
190
+ ## πŸ“Š Data Sources
191
+
192
+ ### Positive Examples (Water Conflict Headlines)
193
+ Pacific Institute (2025). *Water Conflict Chronology*. Pacific Institute, Oakland, CA.
194
+ https://www.worldwater.org/water-conflict/
195
+
196
+ ### Negative Examples (Non-Water Conflict Headlines)
197
+ Armed Conflict Location & Event Data Project (ACLED).
198
+ https://acleddata.com/
199
+
200
+ **Note:** Training negatives include synthetic "hard negatives" - peaceful water-related news (e.g., "New desalination plant opens", "Water conservation conference") to prevent false positives on non-conflict water topics.
201
+
202
+ ## 🌍 About This Project
203
+
204
+ This model is part of independent experimental research drawing on the Pacific Institute's Water Conflict Chronology. The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history.
205
+
206
+ **Project Links:**
207
+ - Pacific Institute Water Conflict Chronology: https://www.worldwater.org/water-conflict/
208
+ - Python Package (PyPI): https://pypi.org/project/water-conflict-classifier/
209
+ - Source Code: https://github.com/baobabtech/waterconflict
210
+ - Model Hub: https://huggingface.co/{model_repo}
211
+
212
+ ### Training Your Own Model
213
+
214
+ You can train your own version using the published package:
215
+
216
+ ```bash
217
+ # Install package
218
+ pip install water-conflict-classifier
219
+
220
+ # Or install from source for development
221
+ git clone https://github.com/baobabtech/waterconflict.git
222
+ cd waterconflict/classifier
223
+ pip install -e .
224
+
225
+ # Train locally
226
+ python train_setfit_headline_classifier.py
227
  ```
228
 
229
+ For cloud training on HuggingFace Jobs infrastructure, see the scripts folder in the repository.
230
+
231
+ ## πŸ“œ License
232
+
233
+ Copyright Β© 2025 Baobab Tech
234
+
235
+ This work is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License](http://creativecommons.org/licenses/by-nc/4.0/).
236
 
237
+ **You are free to:**
238
+ - **Share** β€” copy and redistribute the material in any medium or format
239
+ - **Adapt** β€” remix, transform, and build upon the material
240
 
241
+ **Under the following terms:**
242
+ - **Attribution** β€” You must give appropriate credit to Baobab Tech, provide a link to the license, and indicate if changes were made
243
+ - **NonCommercial** β€” You may not use the material for commercial purposes
244
 
 
 
245
 
246
+ ## πŸ“ Citation
247
+
248
+ If you use this model in your work, please cite:
249
+
250
+ ```bibtex
251
+ @misc{{waterconflict2025,
252
+ title={{Water Conflict Multi-Label Classifier}},
253
+ author={{Independent Experimental Research Drawing on Pacific Institute Water Conflict Chronology}},
254
+ year={{2025}},
255
+ howpublished={{\url{{https://huggingface.co/{model_repo}}}}},
256
+ note={{Training data from Pacific Institute Water Conflict Chronology and ACLED}}
257
+ }}
258
+ ```
259
+
260
+ Please also cite the Pacific Institute's Water Conflict Chronology:
261
+
262
+ ```bibtex
263
+ @misc{{pacificinstitute2025,
264
+ title={{Water Conflict Chronology}},
265
+ author={{Pacific Institute}},
266
+ year={{2025}},
267
+ address={{Oakland, CA}},
268
+ url={{https://www.worldwater.org/water-conflict/}},
269
+ note={{Accessed: [access date]}}
270
+ }}
271
+ ```
272