Upload README.md with huggingface_hub

6f6d923 verified 2 months ago

10.3 kB

	---
	license: cc-by-nc-4.0
	library_name: setfit
	tags:
	- setfit
	- sentence-transformers
	- text-classification
	- multi-label
	- water-conflict
	metrics:
	- f1
	- accuracy
	language:
	- en
	widget:
	- text: "Military attack workers at the Kajaki Dam in Afghanistan"
	- text: "Violent protests erupt over dam construction in Sudan"
	- text: "New water treatment plant opens in California"
	- text: "Armed groups cut off water supply to villages in Syria"
	- text: "Government announces new irrigation subsidies"
	---

	# Water Conflict Multi-Label Classifier

	## 🔬 Experimental Research

	> This experimental research draws on Pacific Institute's [Water Conflict Chronology](https://www.worldwater.org/water-conflict/), which tracks water-related conflicts spanning over 4,500 years of human history. The work is conducted independently and is not affiliated with Pacific Institute.

	This model is designed to assist researchers in classifying water-related conflict events at scale using tiny/small models that can classify 100s of headlines per second.

	The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history. This is not a commercial product and is not intended for commercial use.

	## 📋 Model Description

	This SetFit-based model classifies news headlines about water-related conflicts into three categories:

	- Trigger: Water resource as a conflict trigger
	- Casualty: Water infrastructure as a casualty/target
	- Weapon: Water used as a weapon/tool

	These categories align with the Pacific Institute's Water Conflict Chronology framework for understanding how water intersects with security and conflict.

	## 🏗️ Model Details

	- Base Model: [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
	- Architecture: SetFit with One-vs-Rest multi-label strategy
	- Training Approach: Few-shot learning optimized (SetFit reaches peak performance with small samples)
	- Training samples: 1200 examples
	- Test samples: 519 (held-out, never seen during training)
	- Training time: ~2-5 minutes on A10G GPU
	- Model size: 33M Parameters, ~133MB
	- Inference speed: ~5-10ms per headline on CPU

	## 💻 Usage

	### Quick Start

	```python
	from setfit import SetFitModel

	# Load the trained model from HF Hub
	model = SetFitModel.from_pretrained("baobabtech/water-conflict-classifier")

	# Predict on headlines
	headlines = [
	"Military attack workers at the Kajaki Dam in Afghanistan",
	"New water treatment plant opens in California"
	]

	predictions = model.predict(headlines)
	print(predictions)
	# Output: [[1, 1, 0], [0, 0, 0]]
	# Format: [Trigger, Casualty, Weapon]
	```

	### Interpreting Results

	The model returns a list of binary predictions for each label:

	```python
	label_names = ['Trigger', 'Casualty', 'Weapon']

	for headline, pred in zip(headlines, predictions):
	labels = [label_names[i] for i, val in enumerate(pred) if val == 1]
	print(f"Headline: {headline}")
	print(f"Labels: {', '.join(labels) if labels else 'None'}")
	print()
	```

	### Batch Processing

	```python
	import pandas as pd

	# Load your data
	df = pd.read_csv("your_headlines.csv")

	# Predict in batches
	predictions = model.predict(df['headline'].tolist())

	# Add predictions to dataframe
	df['trigger'] = [p[0] for p in predictions]
	df['casualty'] = [p[1] for p in predictions]
	df['weapon'] = [p[2] for p in predictions]
	```

	### Example Outputs

	\| Headline \| Trigger \| Casualty \| Weapon \|
	\|----------\|---------\|----------\|--------\|
	\| "Armed groups blow up water pipeline in Iraq" \| ✓ \| ✓ \| ✓ \|
	\| "New water treatment plant opens in California" \| ✗ \| ✗ \| ✗ \|
	\| "Protests erupt over dam construction in Ethiopia" \| ✓ \| ✗ \| ✗ \|

	## 📈 Evaluation Results

	Evaluated on a held-out test set of 519 samples (30% of total data, stratified by label combinations).

	### Overall Performance

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Exact Match Accuracy \| 0.8015 \|
	\| Hamming Loss \| 0.0906 \|
	\| F1 (micro) \| 0.8530 \|
	\| F1 (macro) \| 0.7987 \|
	\| F1 (samples) \| 0.7028 \|

	### Per-Label Performance

	\| Label \| Precision \| Recall \| F1 \| Support \|
	\|-------\|-----------\|--------\|-----\|---------\|
	\| Trigger \| 0.8844 \| 0.8793 \| 0.8818 \| 174 \|
	\| Casualty \| 0.8843 \| 0.9185 \| 0.9011 \| 233 \|
	\| Weapon \| 0.4941 \| 0.8077 \| 0.6131 \| 52 \|

	### Training Details

	- Training samples: 1200 examples
	- Test samples: 519 examples (held-out before sampling)
	- Base model: BAAI/bge-small-en-v1.5 (33M params)
	- Batch size: 64
	- Epochs: 1
	- Iterations: 20 (contrastive pair generation)
	- Sampling strategy: undersampling (balances positive/negative pairs)
	- Training Dataset: [baobabtech/water-conflict-training-data](https://huggingface.co/datasets/baobabtech/water-conflict-training-data) (version: d2.0)


	### 📈 Experiment Tracking

	All training runs are automatically tracked in a public dataset for experiment comparison:

	- Evals Dataset: [baobabtech/water-conflict-classifier-evals](https://huggingface.co/datasets/baobabtech/water-conflict-classifier-evals)
	- Tracked Metrics: F1 scores, accuracy, per-label performance, and all hyperparameters
	- Compare Experiments: View how different configurations (sample size, epochs, batch size) affect performance
	- Reproducibility: Full training configs logged for each version

	You can explore past experiments and compare model performance across versions using the evals dataset.


	## 📊 Data Sources

	### Positive Examples (Water Conflict Headlines)
	Pacific Institute (2025). Water Conflict Chronology. Pacific Institute, Oakland, CA.
	https://www.worldwater.org/water-conflict/

	### Negative Examples (Non-Water Conflict Headlines)
	Armed Conflict Location & Event Data Project (ACLED).
	https://acleddata.com/

	Note: Training negatives include synthetic "hard negatives" - peaceful water-related news (e.g., "New desalination plant opens", "Water conservation conference") to prevent false positives on non-conflict water topics.

	## 🌍 About This Project

	This model is part of independent experimental research drawing on the Pacific Institute's Water Conflict Chronology. The Pacific Institute maintains the world's most comprehensive open-source record of water-related conflicts, documenting over 2,700 events across 4,500 years of history.

	Project Links:
	- Pacific Institute Water Conflict Chronology: https://www.worldwater.org/water-conflict/
	- Python Package (PyPI): https://pypi.org/project/water-conflict-classifier/
	- Source Code: https://github.com/baobabtech/waterconflict
	- Model Hub: https://huggingface.co/{model_repo}


	## 🌱 Frugal AI: Training with Limited Data

	This classifier demonstrates an intentional approach to building AI systems with limited data using [SetFit](https://huggingface.co/docs/setfit/en/index) - a framework for few-shot learning with sentence transformers. Rather than defaulting to massive language models (GPT, Claude, or 100B+ parameter models) for simple classification tasks, we fine-tune small, efficient models (e.g., BAAI/bge-small-en-v1.5 with ~33M parameters) on a focused dataset.

	Why this matters: The industry has normalized using trillion-parameter models to classify headlines, answer simple questions, or categorize text - tasks that don't require world knowledge, reasoning, or generative capabilities. This is computationally wasteful and environmentally costly. A properly fine-tuned small model can achieve comparable or better accuracy while using a fraction of the compute resources.

	Our approach:
	- Train on ~600 examples (few-shot learning with SetFit)
	- Deploy small parameter models (e.g., ~33M params) vs. 100B-1T parameter alternatives
	- Achieve specialized task performance without the overhead of general-purpose LLMs
	- Reduce inference costs and latency by orders of magnitude

	This is not about avoiding large models altogether - they're invaluable for complex reasoning tasks. But for targeted classification problems with labeled data, fine-tuning remains the professional, responsible choice.


	### 🏋🏽‍♀️ Training Your Own Model

	You can train your own version using the [published package](https://pypi.org/project/water-conflict-classifier/).

	Package includes:
	- Data preprocessing utilities
	- Training logic (SetFit multi-label)
	- Evaluation metrics
	- Model card generation

	Source code: https://github.com/baobabtech/waterconflict/tree/main/classifier
	PyPI: https://pypi.org/project/water-conflict-classifier/

	```bash
	# Install package
	pip install water-conflict-classifier

	# Or install from source for development
	git clone https://github.com/baobabtech/waterconflict.git
	cd waterconflict/classifier
	pip install -e .

	# Train locally
	python train_setfit_headline_classifier.py
	```

	For cloud training on HuggingFace Jobs infrastructure, see the scripts folder in the repository.

	## 📜 License

	Copyright © 2025 Baobab Tech

	This work is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License](http://creativecommons.org/licenses/by-nc/4.0/).

	You are free to:
	- Share — copy and redistribute the material in any medium or format
	- Adapt — remix, transform, and build upon the material

	Under the following terms:
	- Attribution — You must give appropriate credit to Baobab Tech, provide a link to the license, and indicate if changes were made
	- NonCommercial — You may not use the material for commercial purposes


	## 📝 Citation

	If you use this model in your work, please cite:

	```bibtex
	@misc{{waterconflict2025,
	title={{Water Conflict Multi-Label Classifier}},
	author={{Independent Experimental Research Drawing on Pacific Institute Water Conflict Chronology}},
	year={{2025}},
	howpublished={{\url{{https://huggingface.co/{model_repo}}}}},
	note={{Training data from Pacific Institute Water Conflict Chronology and ACLED}}
	}}
	```

	Please also cite the Pacific Institute's Water Conflict Chronology:

	```bibtex
	@misc{{pacificinstitute2025,
	title={{Water Conflict Chronology}},
	author={{Pacific Institute}},
	year={{2025}},
	address={{Oakland, CA}},
	url={{https://www.worldwater.org/water-conflict/}},
	note={{Accessed: [access date]}}
	}}
	```