baobabtech
/

water-conflict-classifier

@@ -1,203 +1,204 @@
 ---
-language: en
 license: cc-by-nc-4.0
 tags:
 - setfit
 - sentence-transformers
 - text-classification
-- generated_from_setfit_trainer
-widget:
-- text: Gaddafi cuts of water to Libya's capital
-- text: Grenade blast in water tank leaves 40 families without water in Potrerito,
-    Valle del Cauca, Colombia
-- text: Silvan Dam construction site attacked
-- text: in the afternoon, US forces destroy (likely through airstrikes) 2 suspected
-    Houthi patrol boats in an unidentified area in the South Red Sea while Houthi
-    media reported 3 air raids on As Salif coastal district (coded to As Salif Port)
-    (Al Hudaydah). Casaulties unknown.
-- text: a group of Fulani men clashed with and killed a suspected Fulani bull thief
-    in the Goure Kele district of Sakabansi (Nikki, Borgou). He was found dead in
-    his house after being struck with a machete during the clash by one of the members
-    of the group, who then fled.
 metrics:
 - accuracy
-pipeline_tag: text-classification
-library_name: setfit
-inference: false
-base_model: BAAI/bge-small-en-v1.5
 ---
-# SetFit with BAAI/bge-small-en-v1.5
-This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) as the Sentence Transformer embedding model. A OneVsRestClassifier instance is used for classification.
-The model has been trained using an efficient few-shot learning technique that involves:
-1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
-2. Training a classification head with features from the fine-tuned Sentence Transformer.
-## Model Details
-### Model Description
-- **Model Type:** SetFit
-- **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
-- **Classification head:** a OneVsRestClassifier instance
-- **Maximum Sequence Length:** 512 tokens
-- **Number of Classes:** 3 classes
-<!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
-- **Language:** en
-- **License:** cc-by-nc-4.0
-### Model Sources
-- **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
-- **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
-- **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
-## Uses
-### Direct Use for Inference
-First install the SetFit library:
-```bash
-pip install setfit
-```
-Then you can load this model and run inference.
 ```python
 from setfit import SetFitModel
-# Download from the 🤗 Hub
 model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
-# Run inference
-preds = model("Silvan Dam construction site attacked")
 ```
-<!--
-### Downstream Use
-*List how someone could finetune this model on their own dataset.*
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
-## Training Details
-### Training Set Metrics
-| Training set | Min | Median  | Max |
-|:-------------|:----|:--------|:----|
-| Word count   | 4   | 25.9533 | 236 |
-### Training Hyperparameters
-- batch_size: (32, 32)
-- num_epochs: (1, 1)
-- max_steps: -1
-- sampling_strategy: undersampling
-- body_learning_rate: (2e-05, 1e-05)
-- head_learning_rate: 0.01
-- loss: CosineSimilarityLoss
-- distance_metric: cosine_distance
-- margin: 0.25
-- end_to_end: False
-- use_amp: False
-- warmup_proportion: 0.1
-- l2_weight: 0.01
-- seed: 42
-- eval_max_steps: -1
-- load_best_model_at_end: True
-### Training Results
-| Epoch  | Step | Training Loss | Validation Loss |
-|:------:|:----:|:-------------:|:---------------:|
-| 0.0007 | 1    | 0.2168        | -               |
-| 0.0339 | 50   | 0.2108        | -               |
-| 0.0679 | 100  | 0.1126        | -               |
-| 0.1018 | 150  | 0.0719        | -               |
-| 0.1358 | 200  | 0.0616        | -               |
-| 0.1697 | 250  | 0.0518        | -               |
-| 0.2037 | 300  | 0.0454        | -               |
-| 0.2376 | 350  | 0.0393        | -               |
-| 0.2716 | 400  | 0.0324        | -               |
-| 0.3055 | 450  | 0.0265        | -               |
-| 0.3394 | 500  | 0.0279        | -               |
-| 0.3734 | 550  | 0.0231        | -               |
-| 0.4073 | 600  | 0.0231        | -               |
-| 0.4413 | 650  | 0.0228        | -               |
-| 0.4752 | 700  | 0.0272        | -               |
-| 0.5092 | 750  | 0.0216        | -               |
-| 0.5431 | 800  | 0.0186        | -               |
-| 0.5771 | 850  | 0.0195        | -               |
-| 0.6110 | 900  | 0.0174        | -               |
-| 0.6449 | 950  | 0.0163        | -               |
-| 0.6789 | 1000 | 0.0174        | -               |
-| 0.7128 | 1050 | 0.0148        | -               |
-| 0.7468 | 1100 | 0.0167        | -               |
-| 0.7807 | 1150 | 0.0158        | -               |
-| 0.8147 | 1200 | 0.0146        | -               |
-| 0.8486 | 1250 | 0.0146        | -               |
-| 0.8826 | 1300 | 0.0145        | -               |
-| 0.9165 | 1350 | 0.0138        | -               |
-| 0.9504 | 1400 | 0.0142        | -               |
-| 0.9844 | 1450 | 0.013         | -               |
-| 1.0    | 1473 | -             | 0.0577          |
-### Framework Versions
-- Python: 3.12.12
-- SetFit: 1.1.3
-- Sentence Transformers: 5.1.2
-- Transformers: 4.57.3
-- PyTorch: 2.9.1+cu128
-- Datasets: 4.4.1
-- Tokenizers: 0.22.1
-## Citation
-### BibTeX
-```bibtex
-@article{https://doi.org/10.48550/arxiv.2209.11055,
-    doi = {10.48550/ARXIV.2209.11055},
-    url = {https://arxiv.org/abs/2209.11055},
-    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
-    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
-    title = {Efficient Few-Shot Learning Without Prompts},
-    publisher = {arXiv},
-    year = {2022},
-    copyright = {Creative Commons Attribution 4.0 International}
-}
 ```
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

 ---
 license: cc-by-nc-4.0
+library_name: setfit
 tags:
 - setfit
 - sentence-transformers
 - text-classification
+- multi-label
+- water-conflict
 metrics:
+- f1
 - accuracy
+language:
+- en
+widget:
+- text: "Taliban attack workers at the Kajaki Dam in Afghanistan"
+- text: "Violent protests erupt over dam construction in Sudan"
+- text: "New water treatment plant opens in California"
+- text: "ISIS cuts off water supply to villages in Syria"
+- text: "Government announces new irrigation subsidies"
 ---
+# Water Conflict Multi-Label Classifier
+## 🔬 Experimental Research
+> **Note:** This is experimental research in support of the Pacific Institute's [Water Conflict Chronology](https://www.worldwater.org/water-conflict/) project, which tracks water-related conflicts spanning over 4,500 years of human history.
+This model is designed to assist researchers in classifying water-related conflict events. The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history.
+## 📋 Model Description
+This SetFit-based model classifies news headlines about water-related conflicts into three categories:
+- **Trigger**: Water resource as a conflict trigger
+- **Casualty**: Water infrastructure as a casualty/target
+- **Weapon**: Water used as a weapon/tool
+These categories align with the Pacific Institute's Water Conflict Chronology framework for understanding how water intersects with security and conflict.
+## 🏗️ Model Details
+- **Base Model**: BAAI/bge-small-en-v1.5 (33.4M parameters)
+- **Architecture**: SetFit with One-vs-Rest multi-label strategy
+- **Training Approach**: Few-shot learning optimized (SetFit reaches peak performance with small samples)
+- **Training samples**: 600 (sampled from 4468 total training pool)
+- **Test samples**: 789 (held-out, never seen during training)
+- **Training time**: ~2-5 minutes on A10G GPU
+- **Model size**: ~130MB
+- **Inference speed**: ~5-10ms per headline on CPU
+## 💻 Usage
+### Quick Start
 ```python
 from setfit import SetFitModel
+# Load the model
 model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
+# Predict on headlines
+headlines = [
+    "Taliban attack workers at the Kajaki Dam in Afghanistan",
+    "New water treatment plant opens in California"
+]
+predictions = model.predict(headlines)
+print(predictions)
+# Output: [[1, 1, 0], [0, 0, 0]]
+# Format: [Trigger, Casualty, Weapon]
 ```
+### Interpreting Results
+The model returns a list of binary predictions for each label:
+```python
+label_names = ['Trigger', 'Casualty', 'Weapon']
+for headline, pred in zip(headlines, predictions):
+    labels = [label_names[i] for i, val in enumerate(pred) if val == 1]
+    print(f"Headline: {headline}")
+    print(f"Labels: {', '.join(labels) if labels else 'None'}")
+    print()
+```
+### Batch Processing
+```python
+import pandas as pd
+# Load your data
+df = pd.read_csv("your_headlines.csv")
+# Predict in batches
+predictions = model.predict(df['headline'].tolist())
+# Add predictions to dataframe
+df['trigger'] = [p[0] for p in predictions]
+df['casualty'] = [p[1] for p in predictions]
+df['weapon'] = [p[2] for p in predictions]
 ```
+### Example Outputs
+| Headline | Trigger | Casualty | Weapon |
+|----------|---------|----------|--------|
+| "ISIS militants blow up water pipeline in Iraq" | ✓ | ✓ | ✓ |
+| "New water treatment plant opens in California" | ✗ | ✗ | ✗ |
+| "Protests erupt over dam construction in Ethiopia" | ✓ | ✗ | ✗ |
+## Evaluation Results
+Evaluated on a held-out test set of 789 samples (15% of total data, stratified by label combinations).
+### Overall Performance
+| Metric | Score |
+|--------|-------|
+| Exact Match Accuracy | 0.9024 |
+| Hamming Loss | 0.0469 |
+| F1 (micro) | 0.8754 |
+| F1 (macro) | 0.8134 |
+| F1 (samples) | 0.4647 |
+### Per-Label Performance
+| Label | Precision | Recall | F1 | Support |
+|-------|-----------|--------|-----|---------|
+| Trigger | 0.9623 | 0.8844 | 0.9217 | 173 |
+| Casualty | 0.8819 | 0.8970 | 0.8894 | 233 |
+| Weapon | 0.7568 | 0.5385 | 0.6292 | 52 |
+### Training Details
+- **Training samples**: 600 examples
+- **Test samples**: 789 examples (held-out before sampling)
+- **Base model**: BAAI/bge-small-en-v1.5 (33.4M params)
+- **Batch size**: 32
+- **Epochs**: 1
+- **Sampling strategy**: undersampling (balances positive/negative pairs)
+## 📊 Data Sources
+### Positive Examples (Water Conflict Headlines)
+Pacific Institute (2025). *Water Conflict Chronology*. Pacific Institute, Oakland, CA.
+https://www.worldwater.org/water-conflict/
+### Negative Examples (Non-Water Conflict Headlines)
+Armed Conflict Location & Event Data Project (ACLED).
+https://acleddata.com/
+## 🌍 About This Project
+This model is part of experimental research supporting the Pacific Institute's Water Conflict Chronology project. The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history.
+Learn more: https://www.worldwater.org/water-conflict/
+## 📜 License
+Copyright © 2025 Baobab Tech
+This work is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License](http://creativecommons.org/licenses/by-nc/4.0/).
+**You are free to:**
+- **Share** — copy and redistribute the material in any medium or format
+- **Adapt** — remix, transform, and build upon the material
+**Under the following terms:**
+- **Attribution** — You must give appropriate credit to Baobab Tech, provide a link to the license, and indicate if changes were made
+- **NonCommercial** — You may not use the material for commercial purposes
+For commercial licensing inquiries, please contact Baobab Tech.
+## 📝 Citation
+If you use this model in your work, please cite:
+```bibtex
+@misc{waterconflict2025,
+  title={Water Conflict Multi-Label Classifier},
+  author={Experimental Research Supporting Pacific Institute Water Conflict Chronology},
+  year={2025},
+  howpublished={\url{https://huggingface.co/baobabtech/water-conflict-classifier}},
+  note={Training data from Pacific Institute Water Conflict Chronology and ACLED}
+}
+```
+Please also cite the Pacific Institute's Water Conflict Chronology:
+```bibtex
+@misc{pacificinstitute2025,
+  title={Water Conflict Chronology},
+  author={Pacific Institute},
+  year={2025},
+  address={Oakland, CA},
+  url={https://www.worldwater.org/water-conflict/},
+  note={Accessed: [access date]}
+}
+```
+**Recommended citation format:**
+Pacific Institute (2025) Water Conflict Chronology. Pacific Institute, Oakland, CA. https://www.worldwater.org/water-conflict/. Accessed: (access date).