baobabtech
/

water-conflict-classifier

@@ -1,290 +1,268 @@
 ---
-language: en
 license: cc-by-nc-4.0
 tags:
 - setfit
 - sentence-transformers
 - text-classification
-- generated_from_setfit_trainer
-widget:
-- text: Israeli forces destroy water pump in Nablus, West Bank, cutting water supply
-    to over 20,000 Palestinians in multiple villages
-- text: Chinese man killed for speaking out against displacement of communities by
-    the Three Gorges Dam
-- text: Protests over water cuts turn violent in Tunisia
-- text: National leader Dilma Ferreira Silva, working for policy reform to support
-    people affected by dams, is murdered in Brazil
-- text: Water reservoir sustains minor damages from bombing in Colombia
 metrics:
 - accuracy
-pipeline_tag: text-classification
-library_name: setfit
-inference: false
-base_model: BAAI/bge-small-en-v1.5
 ---
-# SetFit with BAAI/bge-small-en-v1.5
-This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) as the Sentence Transformer embedding model. A OneVsRestClassifier instance is used for classification.
-The model has been trained using an efficient few-shot learning technique that involves:
-1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
-2. Training a classification head with features from the fine-tuned Sentence Transformer.
-## Model Details
-### Model Description
-- **Model Type:** SetFit
-- **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
-- **Classification head:** a OneVsRestClassifier instance
-- **Maximum Sequence Length:** 512 tokens
-- **Number of Classes:** 3 classes
-<!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
-- **Language:** en
-- **License:** cc-by-nc-4.0
-### Model Sources
-- **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
-- **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
-- **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
-## Uses
-### Direct Use for Inference
-First install the SetFit library:
-```bash
-pip install setfit
-```
-Then you can load this model and run inference.
 ```python
 from setfit import SetFitModel
-# Download from the 🤗 Hub
 model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
-# Run inference
-preds = model("Protests over water cuts turn violent in Tunisia")
 ```
-<!--
-### Downstream Use
-*List how someone could finetune this model on their own dataset.*
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
-## Training Details
-### Training Set Metrics
-| Training set | Min | Median  | Max |
-|:-------------|:----|:--------|:----|
-| Word count   | 3   | 16.3692 | 154 |
-### Training Hyperparameters
-- batch_size: (32, 32)
-- num_epochs: (4, 4)
-- max_steps: -1
-- sampling_strategy: oversampling
-- num_iterations: 20
-- body_learning_rate: (2e-05, 2e-05)
-- head_learning_rate: 0.01
-- loss: CosineSimilarityLoss
-- distance_metric: cosine_distance
-- margin: 0.25
-- end_to_end: False
-- use_amp: False
-- warmup_proportion: 0.1
-- l2_weight: 0.01
-- seed: 42
-- eval_max_steps: -1
-- load_best_model_at_end: True
-### Training Results
-| Epoch  | Step | Training Loss | Validation Loss |
-|:------:|:----:|:-------------:|:---------------:|
-| 0.0007 | 1    | 0.2228        | -               |
-| 0.0333 | 50   | 0.236         | -               |
-| 0.0667 | 100  | 0.2308        | -               |
-| 0.1    | 150  | 0.2186        | -               |
-| 0.1333 | 200  | 0.203         | -               |
-| 0.1667 | 250  | 0.1836        | -               |
-| 0.2    | 300  | 0.159         | -               |
-| 0.2333 | 350  | 0.1373        | -               |
-| 0.2667 | 400  | 0.1265        | -               |
-| 0.3    | 450  | 0.111         | -               |
-| 0.3333 | 500  | 0.1045        | -               |
-| 0.3667 | 550  | 0.0906        | -               |
-| 0.4    | 600  | 0.0848        | -               |
-| 0.4333 | 650  | 0.0829        | -               |
-| 0.4667 | 700  | 0.0706        | -               |
-| 0.5    | 750  | 0.0631        | -               |
-| 0.5333 | 800  | 0.0625        | -               |
-| 0.5667 | 850  | 0.0572        | -               |
-| 0.6    | 900  | 0.0553        | -               |
-| 0.6333 | 950  | 0.0499        | -               |
-| 0.6667 | 1000 | 0.0533        | -               |
-| 0.7    | 1050 | 0.044         | -               |
-| 0.7333 | 1100 | 0.0486        | -               |
-| 0.7667 | 1150 | 0.045         | -               |
-| 0.8    | 1200 | 0.0411        | -               |
-| 0.8333 | 1250 | 0.0464        | -               |
-| 0.8667 | 1300 | 0.0414        | -               |
-| 0.9    | 1350 | 0.0378        | -               |
-| 0.9333 | 1400 | 0.0379        | -               |
-| 0.9667 | 1450 | 0.0408        | -               |
-| 1.0    | 1500 | 0.0356        | 0.1011          |
-| 1.0333 | 1550 | 0.0338        | -               |
-| 1.0667 | 1600 | 0.0304        | -               |
-| 1.1    | 1650 | 0.0339        | -               |
-| 1.1333 | 1700 | 0.0319        | -               |
-| 1.1667 | 1750 | 0.0331        | -               |
-| 1.2    | 1800 | 0.0307        | -               |
-| 1.2333 | 1850 | 0.0349        | -               |
-| 1.2667 | 1900 | 0.0341        | -               |
-| 1.3    | 1950 | 0.032         | -               |
-| 1.3333 | 2000 | 0.0353        | -               |
-| 1.3667 | 2050 | 0.0312        | -               |
-| 1.4    | 2100 | 0.0313        | -               |
-| 1.4333 | 2150 | 0.0288        | -               |
-| 1.4667 | 2200 | 0.0308        | -               |
-| 1.5    | 2250 | 0.0269        | -               |
-| 1.5333 | 2300 | 0.0292        | -               |
-| 1.5667 | 2350 | 0.0299        | -               |
-| 1.6    | 2400 | 0.0291        | -               |
-| 1.6333 | 2450 | 0.0286        | -               |
-| 1.6667 | 2500 | 0.0283        | -               |
-| 1.7    | 2550 | 0.0299        | -               |
-| 1.7333 | 2600 | 0.0283        | -               |
-| 1.7667 | 2650 | 0.027         | -               |
-| 1.8    | 2700 | 0.0303        | -               |
-| 1.8333 | 2750 | 0.0293        | -               |
-| 1.8667 | 2800 | 0.0281        | -               |
-| 1.9    | 2850 | 0.0288        | -               |
-| 1.9333 | 2900 | 0.0285        | -               |
-| 1.9667 | 2950 | 0.0266        | -               |
-| 2.0    | 3000 | 0.0276        | 0.0950          |
-| 2.0333 | 3050 | 0.0283        | -               |
-| 2.0667 | 3100 | 0.0282        | -               |
-| 2.1    | 3150 | 0.0275        | -               |
-| 2.1333 | 3200 | 0.0263        | -               |
-| 2.1667 | 3250 | 0.025         | -               |
-| 2.2    | 3300 | 0.0256        | -               |
-| 2.2333 | 3350 | 0.0259        | -               |
-| 2.2667 | 3400 | 0.0255        | -               |
-| 2.3    | 3450 | 0.0253        | -               |
-| 2.3333 | 3500 | 0.0261        | -               |
-| 2.3667 | 3550 | 0.0272        | -               |
-| 2.4    | 3600 | 0.0253        | -               |
-| 2.4333 | 3650 | 0.0235        | -               |
-| 2.4667 | 3700 | 0.0264        | -               |
-| 2.5    | 3750 | 0.0267        | -               |
-| 2.5333 | 3800 | 0.0248        | -               |
-| 2.5667 | 3850 | 0.026         | -               |
-| 2.6    | 3900 | 0.0239        | -               |
-| 2.6333 | 3950 | 0.0264        | -               |
-| 2.6667 | 4000 | 0.0243        | -               |
-| 2.7    | 4050 | 0.0224        | -               |
-| 2.7333 | 4100 | 0.0244        | -               |
-| 2.7667 | 4150 | 0.026         | -               |
-| 2.8    | 4200 | 0.0242        | -               |
-| 2.8333 | 4250 | 0.0244        | -               |
-| 2.8667 | 4300 | 0.0238        | -               |
-| 2.9    | 4350 | 0.0263        | -               |
-| 2.9333 | 4400 | 0.0249        | -               |
-| 2.9667 | 4450 | 0.0246        | -               |
-| 3.0    | 4500 | 0.0273        | 0.0951          |
-| 3.0333 | 4550 | 0.0245        | -               |
-| 3.0667 | 4600 | 0.0255        | -               |
-| 3.1    | 4650 | 0.0262        | -               |
-| 3.1333 | 4700 | 0.0236        | -               |
-| 3.1667 | 4750 | 0.022         | -               |
-| 3.2    | 4800 | 0.0224        | -               |
-| 3.2333 | 4850 | 0.0246        | -               |
-| 3.2667 | 4900 | 0.0231        | -               |
-| 3.3    | 4950 | 0.0247        | -               |
-| 3.3333 | 5000 | 0.0251        | -               |
-| 3.3667 | 5050 | 0.0245        | -               |
-| 3.4    | 5100 | 0.0248        | -               |
-| 3.4333 | 5150 | 0.0245        | -               |
-| 3.4667 | 5200 | 0.0232        | -               |
-| 3.5    | 5250 | 0.0245        | -               |
-| 3.5333 | 5300 | 0.022         | -               |
-| 3.5667 | 5350 | 0.0244        | -               |
-| 3.6    | 5400 | 0.0258        | -               |
-| 3.6333 | 5450 | 0.023         | -               |
-| 3.6667 | 5500 | 0.0232        | -               |
-| 3.7    | 5550 | 0.0241        | -               |
-| 3.7333 | 5600 | 0.0229        | -               |
-| 3.7667 | 5650 | 0.0241        | -               |
-| 3.8    | 5700 | 0.0229        | -               |
-| 3.8333 | 5750 | 0.0239        | -               |
-| 3.8667 | 5800 | 0.023         | -               |
-| 3.9    | 5850 | 0.0241        | -               |
-| 3.9333 | 5900 | 0.0232        | -               |
-| 3.9667 | 5950 | 0.0253        | -               |
-| 4.0    | 6000 | 0.0241        | 0.0939          |
-### Framework Versions
-- Python: 3.12.12
-- SetFit: 1.1.3
-- Sentence Transformers: 5.1.2
-- Transformers: 4.57.3
-- PyTorch: 2.9.1+cu128
-- Datasets: 4.4.1
-- Tokenizers: 0.22.1
-## Citation
-### BibTeX
-```bibtex
-@article{https://doi.org/10.48550/arxiv.2209.11055,
-    doi = {10.48550/ARXIV.2209.11055},
-    url = {https://arxiv.org/abs/2209.11055},
-    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
-    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
-    title = {Efficient Few-Shot Learning Without Prompts},
-    publisher = {arXiv},
-    year = {2022},
-    copyright = {Creative Commons Attribution 4.0 International}
-}
 ```
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

 ---
 license: cc-by-nc-4.0
+library_name: setfit
 tags:
 - setfit
 - sentence-transformers
 - text-classification
+- multi-label
+- water-conflict
 metrics:
+- f1
 - accuracy
+language:
+- en
+widget:
+- text: "Military attack workers at the Kajaki Dam in Afghanistan"
+- text: "Violent protests erupt over dam construction in Sudan"
+- text: "New water treatment plant opens in California"
+- text: "Armed groups cut off water supply to villages in Syria"
+- text: "Government announces new irrigation subsidies"
 ---
+# Water Conflict Multi-Label Classifier
+## 🔬 Experimental Research
+> This experimental research draws on Pacific Institute's [Water Conflict Chronology](https://www.worldwater.org/water-conflict/), which tracks water-related conflicts spanning over 4,500 years of human history. The work is conducted independently and is not affiliated with Pacific Institute.
+This model is designed to assist researchers in classifying water-related conflict events at scale using tiny/small models that can classify 100s of headlines per second.
+The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history. This is not a commercial product and is not intended for commercial use.
+## 📋 Model Description
+This SetFit-based model classifies news headlines about water-related conflicts into three categories:
+- **Trigger**: Water resource as a conflict trigger
+- **Casualty**: Water infrastructure as a casualty/target
+- **Weapon**: Water used as a weapon/tool
+These categories align with the Pacific Institute's Water Conflict Chronology framework for understanding how water intersects with security and conflict.
+## 🏗️ Model Details
+- **Base Model**: [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
+- **Architecture**: SetFit with One-vs-Rest multi-label strategy
+- **Training Approach**: Few-shot learning optimized (SetFit reaches peak performance with small samples)
+- **Training samples**: 1200 examples
+- **Test samples**: 519 (held-out, never seen during training)
+- **Training time**: ~2-5 minutes on A10G GPU
+- **Model size**: 33M Parameters, ~133MB
+- **Inference speed**: ~5-10ms per headline on CPU
+## 💻 Usage
+### Quick Start
 ```python
 from setfit import SetFitModel
+# Load the trained model from HF Hub
 model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
+# Predict on headlines
+headlines = [
+    "Military attack workers at the Kajaki Dam in Afghanistan",
+    "New water treatment plant opens in California"
+]
+predictions = model.predict(headlines)
+print(predictions)
+# Output: [[1, 1, 0], [0, 0, 0]]
+# Format: [Trigger, Casualty, Weapon]
 ```
+### Interpreting Results
+The model returns a list of binary predictions for each label:
+```python
+label_names = ['Trigger', 'Casualty', 'Weapon']
+for headline, pred in zip(headlines, predictions):
+    labels = [label_names[i] for i, val in enumerate(pred) if val == 1]
+    print(f"Headline: {headline}")
+    print(f"Labels: {', '.join(labels) if labels else 'None'}")
+    print()
+```
+### Batch Processing
+```python
+import pandas as pd
+# Load your data
+df = pd.read_csv("your_headlines.csv")
+# Predict in batches
+predictions = model.predict(df['headline'].tolist())
+# Add predictions to dataframe
+df['trigger'] = [p[0] for p in predictions]
+df['casualty'] = [p[1] for p in predictions]
+df['weapon'] = [p[2] for p in predictions]
+```
+### Example Outputs
+| Headline | Trigger | Casualty | Weapon |
+|----------|---------|----------|--------|
+| "Armed groups blow up water pipeline in Iraq" | ✓ | ✓ | ✓ |
+| "New water treatment plant opens in California" | ✗ | ✗ | ✗ |
+| "Protests erupt over dam construction in Ethiopia" | ✓ | ✗ | ✗ |
+## 📈 Evaluation Results
+Evaluated on a held-out test set of 519 samples (30% of total data, stratified by label combinations).
+### Overall Performance
+| Metric | Score |
+|--------|-------|
+| Exact Match Accuracy | 0.8227 |
+| Hamming Loss | 0.0796 |
+| F1 (micro) | 0.8700 |
+| F1 (macro) | 0.8221 |
+| F1 (samples) | 0.7090 |
+### Per-Label Performance
+| Label | Precision | Recall | F1 | Support |
+|-------|-----------|--------|-----|---------|
+| Trigger | 0.8750 | 0.8851 | 0.8800 | 174 |
+| Casualty | 0.8902 | 0.9399 | 0.9144 | 233 |
+| Weapon | 0.5753 | 0.8077 | 0.6720 | 52 |
+### Training Details
+- **Training samples**: 1200 examples
+- **Test samples**: 519 examples (held-out before sampling)
+- **Base model**: BAAI/bge-small-en-v1.5 (33M params)
+- **Batch size**: 32
+- **Epochs**: 4
+- **Iterations**: 20 (contrastive pair generation)
+- **Sampling strategy**: oversampling (balances positive/negative pairs)
+- **Training Dataset**: [baobabtech/water-conflict-training-data](https://huggingface.co/datasets/baobabtech/water-conflict-training-data) (version: d2.0)
+### 📈 Experiment Tracking
+All training runs are automatically tracked in a public dataset for experiment comparison:
+- **Evals Dataset**: [baobabtech/water-conflict-classifier-evals](https://huggingface.co/datasets/baobabtech/water-conflict-classifier-evals)
+- **Tracked Metrics**: F1 scores, accuracy, per-label performance, and all hyperparameters
+- **Compare Experiments**: View how different configurations (sample size, epochs, batch size) affect performance
+- **Reproducibility**: Full training configs logged for each version
+You can explore past experiments and compare model performance across versions using the evals dataset.
+## 📊 Data Sources
+### Positive Examples (Water Conflict Headlines)
+Pacific Institute (2025). *Water Conflict Chronology*. Pacific Institute, Oakland, CA.
+https://www.worldwater.org/water-conflict/
+### Negative Examples (Non-Water Conflict Headlines)
+Armed Conflict Location & Event Data Project (ACLED).
+https://acleddata.com/
+**Note:** Training negatives include synthetic "hard negatives" - peaceful water-related news (e.g., "New desalination plant opens", "Water conservation conference") to prevent false positives on non-conflict water topics.
+## 🌍 About This Project
+This model is part of independent experimental research drawing on the Pacific Institute's Water Conflict Chronology. The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history.
+**Project Links:**
+- Pacific Institute Water Conflict Chronology: https://www.worldwater.org/water-conflict/
+- Python Package (PyPI): https://pypi.org/project/water-conflict-classifier/
+- Source Code: https://github.com/baobabtech/waterconflict
+- Model Hub: https://huggingface.co/{model_repo}
+## 🌱 Frugal AI: Training with Limited Data
+This classifier demonstrates an intentional approach to building AI systems with **limited data** using [SetFit](https://huggingface.co/docs/setfit/en/index) - a framework for few-shot learning with sentence transformers. Rather than defaulting to massive language models (GPT, Claude, or 100B+ parameter models) for simple classification tasks, we fine-tune small, efficient models (e.g., BAAI/bge-small-en-v1.5 with ~33M parameters) on a focused dataset.
+**Why this matters:** The industry has normalized using trillion-parameter models to classify headlines, answer simple questions, or categorize text - tasks that don't require world knowledge, reasoning, or generative capabilities. This is computationally wasteful and environmentally costly. A properly fine-tuned small model can achieve comparable or better accuracy while using a fraction of the compute resources.
+**Our approach:**
+- Train on ~600 examples (few-shot learning with SetFit)
+- Deploy small parameter models (e.g., ~33M params) vs. 100B-1T parameter alternatives
+- Achieve specialized task performance without the overhead of general-purpose LLMs
+- Reduce inference costs and latency by orders of magnitude
+This is not about avoiding large models altogether - they're invaluable for complex reasoning tasks. But for targeted classification problems with labeled data, fine-tuning remains the professional, responsible choice.
+### 🏋🏽‍♀️ Training Your Own Model
+You can train your own version using the [published package](https://pypi.org/project/water-conflict-classifier/).
+**Package includes:**
+- Data preprocessing utilities
+- Training logic (SetFit multi-label)
+- Evaluation metrics
+- Model card generation
+**Source code:** https://github.com/baobabtech/waterconflict/tree/main/classifier
+**PyPI:** https://pypi.org/project/water-conflict-classifier/
+```bash
+# Install package
+pip install water-conflict-classifier
+# Or install from source for development
+git clone https://github.com/baobabtech/waterconflict.git
+cd waterconflict/classifier
+pip install -e .
+# Train locally
+python train_setfit_headline_classifier.py
 ```
+For cloud training on HuggingFace Jobs infrastructure, see the scripts folder in the repository.
+## 📜 License
+Copyright © 2025 Baobab Tech
+This work is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License](http://creativecommons.org/licenses/by-nc/4.0/).
+**You are free to:**
+- **Share** — copy and redistribute the material in any medium or format
+- **Adapt** — remix, transform, and build upon the material
+**Under the following terms:**
+- **Attribution** — You must give appropriate credit to Baobab Tech, provide a link to the license, and indicate if changes were made
+- **NonCommercial** — You may not use the material for commercial purposes
+## 📝 Citation
+If you use this model in your work, please cite:
+```bibtex
+@misc{{waterconflict2025,
+  title={{Water Conflict Multi-Label Classifier}},
+  author={{Independent Experimental Research Drawing on Pacific Institute Water Conflict Chronology}},
+  year={{2025}},
+  howpublished={{\url{{https://huggingface.co/{model_repo}}}}},
+  note={{Training data from Pacific Institute Water Conflict Chronology and ACLED}}
+}}
+```
+Please also cite the Pacific Institute's Water Conflict Chronology:
+```bibtex
+@misc{{pacificinstitute2025,
+  title={{Water Conflict Chronology}},
+  author={{Pacific Institute}},
+  year={{2025}},
+  address={{Oakland, CA}},
+  url={{https://www.worldwater.org/water-conflict/}},
+  note={{Accessed: [access date]}}
+}}
+```