baobabtech
/

water-conflict-classifier

@@ -1,268 +1,185 @@
 ---
 license: cc-by-nc-4.0
-library_name: setfit
 tags:
 - setfit
 - sentence-transformers
 - text-classification
-- multi-label
-- water-conflict
 metrics:
-- f1
 - accuracy
-language:
-- en
-widget:
-- text: "Military attack workers at the Kajaki Dam in Afghanistan"
-- text: "Violent protests erupt over dam construction in Sudan"
-- text: "New water treatment plant opens in California"
-- text: "Armed groups cut off water supply to villages in Syria"
-- text: "Government announces new irrigation subsidies"
 ---
-# Water Conflict Multi-Label Classifier
-## 🔬 Experimental Research
-> This experimental research draws on Pacific Institute's [Water Conflict Chronology](https://www.worldwater.org/water-conflict/), which tracks water-related conflicts spanning over 4,500 years of human history. The work is conducted independently and is not affiliated with Pacific Institute.
-This model is designed to assist researchers in classifying water-related conflict events at scale using tiny/small models that can classify 100s of headlines per second.
-The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history. This is not a commercial product and is not intended for commercial use.
-## 📋 Model Description
-This SetFit-based model classifies news headlines about water-related conflicts into three categories:
-- **Trigger**: Water resource as a conflict trigger
-- **Casualty**: Water infrastructure as a casualty/target
-- **Weapon**: Water used as a weapon/tool
-These categories align with the Pacific Institute's Water Conflict Chronology framework for understanding how water intersects with security and conflict.
-## 🏗️ Model Details
-- **Base Model**: [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
-- **Architecture**: SetFit with One-vs-Rest multi-label strategy
-- **Training Approach**: Few-shot learning optimized (SetFit reaches peak performance with small samples)
-- **Training samples**: 1200 examples
-- **Test samples**: 519 (held-out, never seen during training)
-- **Training time**: ~2-5 minutes on A10G GPU
-- **Model size**: 33M Parameters, ~133MB
-- **Inference speed**: ~5-10ms per headline on CPU
-## 💻 Usage
-### Quick Start
 ```python
 from setfit import SetFitModel
-# Load the trained model from HF Hub
 model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
-# Predict on headlines
-headlines = [
-    "Military attack workers at the Kajaki Dam in Afghanistan",
-    "New water treatment plant opens in California"
-]
-predictions = model.predict(headlines)
-print(predictions)
-# Output: [[1, 1, 0], [0, 0, 0]]
-# Format: [Trigger, Casualty, Weapon]
-```
-### Interpreting Results
-The model returns a list of binary predictions for each label:
-```python
-label_names = ['Trigger', 'Casualty', 'Weapon']
-for headline, pred in zip(headlines, predictions):
-    labels = [label_names[i] for i, val in enumerate(pred) if val == 1]
-    print(f"Headline: {headline}")
-    print(f"Labels: {', '.join(labels) if labels else 'None'}")
-    print()
 ```
-### Batch Processing
-```python
-import pandas as pd
-# Load your data
-df = pd.read_csv("your_headlines.csv")
-# Predict in batches
-predictions = model.predict(df['headline'].tolist())
-# Add predictions to dataframe
-df['trigger'] = [p[0] for p in predictions]
-df['casualty'] = [p[1] for p in predictions]
-df['weapon'] = [p[2] for p in predictions]
-```
-### Example Outputs
-| Headline | Trigger | Casualty | Weapon |
-|----------|---------|----------|--------|
-| "Armed groups blow up water pipeline in Iraq" | ✓ | ✓ | ✓ |
-| "New water treatment plant opens in California" | ✗ | ✗ | ✗ |
-| "Protests erupt over dam construction in Ethiopia" | ✓ | ✗ | ✗ |
-## 📈 Evaluation Results
-Evaluated on a held-out test set of 519 samples (30% of total data, stratified by label combinations).
-### Overall Performance
-| Metric | Score |
-|--------|-------|
-| Exact Match Accuracy | 0.8092 |
-| Hamming Loss | 0.0899 |
-| F1 (micro) | 0.8523 |
-| F1 (macro) | 0.7983 |
-| F1 (samples) | 0.6993 |
-### Per-Label Performance
-| Label | Precision | Recall | F1 | Support |
-|-------|-----------|--------|-----|---------|
-| Trigger | 0.9030 | 0.8563 | 0.8791 | 174 |
-| Casualty | 0.8807 | 0.9185 | 0.8992 | 233 |
-| Weapon | 0.5062 | 0.7885 | 0.6165 | 52 |
-### Training Details
-- **Training samples**: 1200 examples
-- **Test samples**: 519 examples (held-out before sampling)
-- **Base model**: BAAI/bge-small-en-v1.5 (33M params)
-- **Batch size**: 64
-- **Epochs**: 2
-- **Iterations**: 20 (contrastive pair generation)
-- **Sampling strategy**: undersampling (balances positive/negative pairs)
-- **Training Dataset**: [baobabtech/water-conflict-training-data](https://huggingface.co/datasets/baobabtech/water-conflict-training-data) (version: d2.0)
-### 📈 Experiment Tracking
-All training runs are automatically tracked in a public dataset for experiment comparison:
-- **Evals Dataset**: [baobabtech/water-conflict-classifier-evals](https://huggingface.co/datasets/baobabtech/water-conflict-classifier-evals)
-- **Tracked Metrics**: F1 scores, accuracy, per-label performance, and all hyperparameters
-- **Compare Experiments**: View how different configurations (sample size, epochs, batch size) affect performance
-- **Reproducibility**: Full training configs logged for each version
-You can explore past experiments and compare model performance across versions using the evals dataset.
-## 📊 Data Sources
-### Positive Examples (Water Conflict Headlines)
-Pacific Institute (2025). *Water Conflict Chronology*. Pacific Institute, Oakland, CA.
-https://www.worldwater.org/water-conflict/
-### Negative Examples (Non-Water Conflict Headlines)
-Armed Conflict Location & Event Data Project (ACLED).
-https://acleddata.com/
-**Note:** Training negatives include synthetic "hard negatives" - peaceful water-related news (e.g., "New desalination plant opens", "Water conservation conference") to prevent false positives on non-conflict water topics.
-## 🌍 About This Project
-This model is part of independent experimental research drawing on the Pacific Institute's Water Conflict Chronology. The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history.
-**Project Links:**
-- Pacific Institute Water Conflict Chronology: https://www.worldwater.org/water-conflict/
-- Python Package (PyPI): https://pypi.org/project/water-conflict-classifier/
-- Source Code: https://github.com/baobabtech/waterconflict
-- Model Hub: https://huggingface.co/{model_repo}
-## 🌱 Frugal AI: Training with Limited Data
-This classifier demonstrates an intentional approach to building AI systems with **limited data** using [SetFit](https://huggingface.co/docs/setfit/en/index) - a framework for few-shot learning with sentence transformers. Rather than defaulting to massive language models (GPT, Claude, or 100B+ parameter models) for simple classification tasks, we fine-tune small, efficient models (e.g., BAAI/bge-small-en-v1.5 with ~33M parameters) on a focused dataset.
-**Why this matters:** The industry has normalized using trillion-parameter models to classify headlines, answer simple questions, or categorize text - tasks that don't require world knowledge, reasoning, or generative capabilities. This is computationally wasteful and environmentally costly. A properly fine-tuned small model can achieve comparable or better accuracy while using a fraction of the compute resources.
-**Our approach:**
-- Train on ~600 examples (few-shot learning with SetFit)
-- Deploy small parameter models (e.g., ~33M params) vs. 100B-1T parameter alternatives
-- Achieve specialized task performance without the overhead of general-purpose LLMs
-- Reduce inference costs and latency by orders of magnitude
-This is not about avoiding large models altogether - they're invaluable for complex reasoning tasks. But for targeted classification problems with labeled data, fine-tuning remains the professional, responsible choice.
-### 🏋🏽‍♀️ Training Your Own Model
-You can train your own version using the [published package](https://pypi.org/project/water-conflict-classifier/).
-**Package includes:**
-- Data preprocessing utilities
-- Training logic (SetFit multi-label)
-- Evaluation metrics
-- Model card generation
-**Source code:** https://github.com/baobabtech/waterconflict/tree/main/classifier
-**PyPI:** https://pypi.org/project/water-conflict-classifier/
-```bash
-# Install package
-pip install water-conflict-classifier
-# Or install from source for development
-git clone https://github.com/baobabtech/waterconflict.git
-cd waterconflict/classifier
-pip install -e .
-# Train locally
-python train_setfit_headline_classifier.py
 ```
-For cloud training on HuggingFace Jobs infrastructure, see the scripts folder in the repository.
-## 📜 License
-Copyright © 2025 Baobab Tech
-This work is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License](http://creativecommons.org/licenses/by-nc/4.0/).
-**You are free to:**
-- **Share** — copy and redistribute the material in any medium or format
-- **Adapt** — remix, transform, and build upon the material
-**Under the following terms:**
-- **Attribution** — You must give appropriate credit to Baobab Tech, provide a link to the license, and indicate if changes were made
-- **NonCommercial** — You may not use the material for commercial purposes
-## 📝 Citation
-If you use this model in your work, please cite:
-```bibtex
-@misc{{waterconflict2025,
-  title={{Water Conflict Multi-Label Classifier}},
-  author={{Independent Experimental Research Drawing on Pacific Institute Water Conflict Chronology}},
-  year={{2025}},
-  howpublished={{\url{{https://huggingface.co/{model_repo}}}}},
-  note={{Training data from Pacific Institute Water Conflict Chronology and ACLED}}
-}}
-```
-Please also cite the Pacific Institute's Water Conflict Chronology:
-```bibtex
-@misc{{pacificinstitute2025,
-  title={{Water Conflict Chronology}},
-  author={{Pacific Institute}},
-  year={{2025}},
-  address={{Oakland, CA}},
-  url={{https://www.worldwater.org/water-conflict/}},
-  note={{Accessed: [access date]}}
-}}
-```

 ---
+language: en
 license: cc-by-nc-4.0
 tags:
 - setfit
 - sentence-transformers
 - text-classification
+- generated_from_setfit_trainer
+widget:
+- text: Israeli forces destroy water pump in Nablus, West Bank, cutting water supply
+    to over 20,000 Palestinians in multiple villages
+- text: Chinese man killed for speaking out against displacement of communities by
+    the Three Gorges Dam
+- text: Protests over water cuts turn violent in Tunisia
+- text: National leader Dilma Ferreira Silva, working for policy reform to support
+    people affected by dams, is murdered in Brazil
+- text: Water reservoir sustains minor damages from bombing in Colombia
 metrics:
 - accuracy
+pipeline_tag: text-classification
+library_name: setfit
+inference: false
+base_model: BAAI/bge-small-en-v1.5
 ---
+# SetFit with BAAI/bge-small-en-v1.5
+This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) as the Sentence Transformer embedding model. A OneVsRestClassifier instance is used for classification.
+The model has been trained using an efficient few-shot learning technique that involves:
+1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
+2. Training a classification head with features from the fine-tuned Sentence Transformer.
+## Model Details
+### Model Description
+- **Model Type:** SetFit
+- **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
+- **Classification head:** a OneVsRestClassifier instance
+- **Maximum Sequence Length:** 512 tokens
+- **Number of Classes:** 3 classes
+<!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
+- **Language:** en
+- **License:** cc-by-nc-4.0
+### Model Sources
+- **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
+- **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
+- **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
+## Uses
+### Direct Use for Inference
+First install the SetFit library:
+```bash
+pip install setfit
+```
+Then you can load this model and run inference.
 ```python
 from setfit import SetFitModel
+# Download from the 🤗 Hub
 model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")
+# Run inference
+preds = model("Protests over water cuts turn violent in Tunisia")
 ```
+<!--
+### Downstream Use
+*List how someone could finetune this model on their own dataset.*
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Set Metrics
+| Training set | Min | Median  | Max |
+|:-------------|:----|:--------|:----|
+| Word count   | 3   | 16.3692 | 154 |
+### Training Hyperparameters
+- batch_size: (64, 64)
+- num_epochs: (1, 1)
+- max_steps: -1
+- sampling_strategy: undersampling
+- num_iterations: 20
+- body_learning_rate: (2e-05, 2e-05)
+- head_learning_rate: 0.01
+- loss: CosineSimilarityLoss
+- distance_metric: cosine_distance
+- margin: 0.25
+- end_to_end: False
+- use_amp: False
+- warmup_proportion: 0.1
+- l2_weight: 0.01
+- seed: 42
+- eval_max_steps: -1
+- load_best_model_at_end: True
+### Training Results
+| Epoch  | Step | Training Loss | Validation Loss |
+|:------:|:----:|:-------------:|:---------------:|
+| 0.0013 | 1    | 0.2353        | -               |
+| 0.0667 | 50   | 0.2291        | -               |
+| 0.1333 | 100  | 0.1807        | -               |
+| 0.2    | 150  | 0.1317        | -               |
+| 0.2667 | 200  | 0.1064        | -               |
+| 0.3333 | 250  | 0.0919        | -               |
+| 0.4    | 300  | 0.0808        | -               |
+| 0.4667 | 350  | 0.0745        | -               |
+| 0.5333 | 400  | 0.0665        | -               |
+| 0.6    | 450  | 0.0622        | -               |
+| 0.6667 | 500  | 0.0578        | -               |
+| 0.7333 | 550  | 0.0546        | -               |
+| 0.8    | 600  | 0.0523        | -               |
+| 0.8667 | 650  | 0.053         | -               |
+| 0.9333 | 700  | 0.0492        | -               |
+| 1.0    | 750  | 0.0505        | 0.0997          |
+### Framework Versions
+- Python: 3.12.12
+- SetFit: 1.1.3
+- Sentence Transformers: 5.1.2
+- Transformers: 4.57.3
+- PyTorch: 2.9.1+cu128
+- Datasets: 4.4.1
+- Tokenizers: 0.22.1
+## Citation
+### BibTeX
+```bibtex
+@article{https://doi.org/10.48550/arxiv.2209.11055,
+    doi = {10.48550/ARXIV.2209.11055},
+    url = {https://arxiv.org/abs/2209.11055},
+    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
+    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
+    title = {Efficient Few-Shot Learning Without Prompts},
+    publisher = {arXiv},
+    year = {2022},
+    copyright = {Creative Commons Attribution 4.0 International}
+}
 ```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config_setfit.json CHANGED Viewed

@@ -1,8 +1,8 @@
 {
-  "normalize_embeddings": false,
   "labels": [
     "Trigger",
     "Casualty",
     "Weapon"
-  ]
 }

 {
   "labels": [
     "Trigger",
     "Casualty",
     "Weapon"
+  ],
+  "normalize_embeddings": false
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9f675d6abd580aff11b20c655971d1f1d956f72d47aaf98928c1face94e95b56
 size 133462128

 version https://git-lfs.github.com/spec/v1
+oid sha256:e2e8e508225135db2b8aa14e148509f45d49310d4ea4357573c79b0ec6ade4d2
 size 133462128

model_head.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0d98c8e5416b64be61f0c4623cef9e74e93eaf99a30f7772d5ff92e699bb8667
 size 11236

 version https://git-lfs.github.com/spec/v1
+oid sha256:dccc0e876439de20b04d9efb2f76f1441a1b548b5edc8a61d6d0174ca20aafb1
 size 11236