initial commit
Browse files
README.md
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- multilabel-classification
|
| 5 |
+
- multilingual
|
| 6 |
+
- twitter
|
| 7 |
+
- violence-prediction
|
| 8 |
+
datasets:
|
| 9 |
+
- m2im/multilingual-twitter-collective-violence-dataset
|
| 10 |
+
language:
|
| 11 |
+
- multilingual
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Model Card for m2im/labse_finetuned_twitter
|
| 15 |
+
|
| 16 |
+
This model is a fine-tuned version of LaBSE (Language-agnostic BERT Sentence Embedding), specifically adapted to detect collective violence signals in multilingual Twitter discourse. It was developed as part of a research project focused on early-warning systems for conflict prediction.
|
| 17 |
+
|
| 18 |
+
## Model Details
|
| 19 |
+
|
| 20 |
+
### Model Description
|
| 21 |
+
|
| 22 |
+
- **Developed by:** Dr. Milton Mendieta and Dr. Timothy Warren
|
| 23 |
+
- **Funded by:** Coalition for Open-Source Defense Analysis (CODA) Lab, Department of Defense Analysis, Naval Postgraduate School (NPS)
|
| 24 |
+
- **Shared by:** Dr. Milton Mendieta and Dr. Timothy Warren
|
| 25 |
+
- **Model type:** Transformer-based sentence encoder fine-tuned for multilabel classification
|
| 26 |
+
- **Language(s):** Originally pre-trained on 109 languages (LaBSE), then fine-tuned on 68 languages from X (formerly Twitter, 2014 onward), including the undefined `und` language category
|
| 27 |
+
|
| 28 |
+
- **License:** MIT
|
| 29 |
+
- **Finetuned from model:** [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE)
|
| 30 |
+
|
| 31 |
+
### Model Sources
|
| 32 |
+
|
| 33 |
+
- **Repository:** [https://github.com/m2im/violence_prediction](https://github.com/m2im/violence_prediction)
|
| 34 |
+
- **Paper:** TBD
|
| 35 |
+
|
| 36 |
+
## Uses
|
| 37 |
+
|
| 38 |
+
### Direct Use
|
| 39 |
+
|
| 40 |
+
This model is intended to classify tweets in multiple languages into predefined categories related to proximity to collective violence events.
|
| 41 |
+
|
| 42 |
+
### Downstream Use
|
| 43 |
+
|
| 44 |
+
The model may be embedded into conflict early-warning systems, government monitoring platforms, or research pipelines analyzing social unrest.
|
| 45 |
+
|
| 46 |
+
### Out-of-Scope Use
|
| 47 |
+
|
| 48 |
+
- General-purpose sentiment analysis
|
| 49 |
+
- Legal, health, or financial decision-making
|
| 50 |
+
- Use in low-resource languages not covered by training data
|
| 51 |
+
|
| 52 |
+
## Bias, Risks, and Limitations
|
| 53 |
+
|
| 54 |
+
- **Geographic bias**: The model was primarily trained on short-duration violent events around the world, which limits its applicability to long-running conflicts (e.g., Russia-Ukraine) or high-noise environments (e.g., Washington, D.C.).
|
| 55 |
+
- **Temporal bias**: Performance degrades in pre-violence scenarios, especially at larger spatial scales (50 km), where signals are weaker and often masked by noise.
|
| 56 |
+
- **Sample size sensitivity**: The model underperforms when fewer than 5,000 observations are available per label, reducing reliability in low-data settings.
|
| 57 |
+
- **Spatial ambiguity**: Frequent misclassification between `pre7geo50` and `post7geo50` labels highlights the model鈥檚 challenge in distinguishing temporal contexts at broader spatial radii.
|
| 58 |
+
- **Language coverage limitations**: While fine-tuned on 67 languages, performance may vary for underrepresented or informal language variants.
|
| 59 |
+
|
| 60 |
+
## Recommendations
|
| 61 |
+
|
| 62 |
+
- **Use with short-term events**: For best results, apply the model to short-term events with geographically concentrated discourse, aligning with the training data distribution.
|
| 63 |
+
- **Avoid low-sample inference**: Do not deploy the model in scenarios where fewer than 5,000 labeled observations are available per class.
|
| 64 |
+
- **Limit reliance on large-radius labels**: Exercise caution when interpreting predictions at 50 km radii, which tend to capture noisy or irrelevant information.
|
| 65 |
+
- **Contextual validation**: Evaluate model performance on local data before broader deployment, especially in unfamiliar regions or languages.
|
| 66 |
+
- **Consider post-processing**: Incorporate ensemble methods or threshold adjustments to improve label differentiation in ambiguous cases.
|
| 67 |
+
- **Batch predictions**: Avoid use in isolated tweets; batch predictions are more reliable
|
| 68 |
+
|
| 69 |
+
## How to Get Started with the Model
|
| 70 |
+
|
| 71 |
+
```python
|
| 72 |
+
from transformers import pipeline
|
| 73 |
+
import html, re
|
| 74 |
+
|
| 75 |
+
def clean_tweet(example):
|
| 76 |
+
tweet = example['text']
|
| 77 |
+
tweet = tweet.replace("\n", " ")
|
| 78 |
+
tweet = html.unescape(tweet)
|
| 79 |
+
tweet = re.sub("@[A-Za-z0-9_:]+", "", tweet)
|
| 80 |
+
tweet = re.sub(r'http\S+', '', tweet)
|
| 81 |
+
tweet = re.sub('RT ', '', tweet)
|
| 82 |
+
return {'text': tweet.strip()}
|
| 83 |
+
|
| 84 |
+
pipe = pipeline("text-classification", model="m2im/labse_finetuned_twitter", tokenizer="m2im/labse_finetuned_twitter", top_k=None)
|
| 85 |
+
|
| 86 |
+
example = {"text": "Protesta en Quito por medidas econ贸micas."}
|
| 87 |
+
cleaned = clean_tweet(example)
|
| 88 |
+
print(pipe(cleaned["text"]))
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## Training Details
|
| 92 |
+
|
| 93 |
+
### Training Data
|
| 94 |
+
|
| 95 |
+
- Dataset: [m2im/multilingual-twitter-collective-violence-dataset](https://huggingface.co/datasets/m2im/multilingual-twitter-collective-violence-dataset)
|
| 96 |
+
- Labels: 6 of the most informative out of 40 available:
|
| 97 |
+
- `pre7geo10`, `pre7geo30`, `pre7geo50`
|
| 98 |
+
- `post7geo10`, `post7geo30`, `post7geo50`
|
| 99 |
+
|
| 100 |
+
### Training Procedure
|
| 101 |
+
|
| 102 |
+
- Text preprocessing using tweet normalization (removal of mentions, URLs, etc.)
|
| 103 |
+
- Tokenization with LaBSE tokenizer
|
| 104 |
+
- Multi-label head using `BCEWithLogitsLoss`
|
| 105 |
+
|
| 106 |
+
#### Training Hyperparameters
|
| 107 |
+
|
| 108 |
+
- Model checkpoints: `setu4993/LaBSE`
|
| 109 |
+
- Head class: `AutoModelForSequenceClassification`
|
| 110 |
+
- Optimizer: AdamW
|
| 111 |
+
- Batch size (train/validation): 1024
|
| 112 |
+
- Epochs: 20
|
| 113 |
+
- Learning rate: 5e-5
|
| 114 |
+
- Learning rate scheduler: Cosine
|
| 115 |
+
- Weight decay: 0.1
|
| 116 |
+
- Max sequence length: 32
|
| 117 |
+
- Precision: Mixed fp16
|
| 118 |
+
- Random seed: 42
|
| 119 |
+
- Saving strategy: Save the best model only when the ROC-AUC score improves on the validation set
|
| 120 |
+
|
| 121 |
+
## Evaluation
|
| 122 |
+
|
| 123 |
+
### Testing Data, Factors & Metrics
|
| 124 |
+
|
| 125 |
+
- **Dataset**: Held-out portion of the multilingual Twitter collective violence dataset, including over 275,000 tweets labeled across six spatio-temporal categories (`pre7geo10`, `pre7geo30`, `pre7geo50`, `post7geo10`, `post7geo30`, `post7geo50`).
|
| 126 |
+
- **Metrics**:
|
| 127 |
+
- **ROC-AUC** (Receiver Operating Characteristic - Area Under Curve): Evaluates the model鈥檚 ability to distinguish between classes across all thresholds.
|
| 128 |
+
- **Macro F1**: Harmonic mean of precision and recall, averaged equally across all classes.
|
| 129 |
+
- **Micro F1**: Harmonic mean of precision and recall, aggregated globally across all predictions.
|
| 130 |
+
- **Precision** and **Recall**: Standard classification metrics to assess false positive and false negative trade-offs.
|
| 131 |
+
|
| 132 |
+
### Results
|
| 133 |
+
|
| 134 |
+
- Classical ML models (Random Forest, SVM, Bagging, Boosting, and Decision Trees) were trained on LaBSE-generated sentence embeddings. The best performing classical model---Random Forest---achieved a **macro F1 score of approximately 0.61**, indicating that embeddings alone provide meaningful but limited discrimination for the multilabel classification task.
|
| 135 |
+
- In contrast, the **fine-tuned LaBSE model**, trained end-to-end with a classification head, outperformed all baseline classical models by achieving a **ROC-AUC score of 0.7238** on the validation set and **0.6988** on the test set.
|
| 136 |
+
- These results demonstrate the value of supervised fine-tuning over using frozen embeddings with classical classifiers, particularly in tasks involving subtle multilingual and spatio-temporal signal detection.
|
| 137 |
+
|
| 138 |
+
## Model Examination
|
| 139 |
+
|
| 140 |
+
- Embedding analysis was conducted using a two-stage dimensionality reduction process: Principal Component Analysis (PCA) reduced the 768-dimensional LaBSE sentence embeddings to 50 dimensions, followed by Uniform Manifold Approximation and Projection (UMAP) to reduce to 2 dimensions for visualization.
|
| 141 |
+
- The resulting 2D projections revealed coherent clustering of sentence embeddings by label, particularly in post-violence scenarios and at smaller spatial scales (10 km), indicating that the model effectively captures latent structure related to spatio-temporal patterns of collective violence.
|
| 142 |
+
- Examination of classification performance across labels further confirmed that the model is most reliable when predicting post-violence instances near the epicenter of an event, while its ability to detect pre-violence signals---especially at broader spatial radii (50 km)---is weaker and more prone to noise.
|
| 143 |
+
|
| 144 |
+
## Environmental Impact
|
| 145 |
+
|
| 146 |
+
- **Hardware Type:** 16 NVIDIA Tesla V100 GPUs
|
| 147 |
+
- **Hours used:** ~10 hours
|
| 148 |
+
- **Cloud Provider:** University research computing cluster
|
| 149 |
+
- **Compute Region:** North America
|
| 150 |
+
- **Carbon Emitted:** Not formally calculated
|
| 151 |
+
|
| 152 |
+
## Technical Specifications
|
| 153 |
+
|
| 154 |
+
### Model Architecture and Objective
|
| 155 |
+
|
| 156 |
+
- Transformer encoder (BERT-based)
|
| 157 |
+
- Objective: Multilabel binary classification with sentence embeddings
|
| 158 |
+
|
| 159 |
+
### Compute Infrastructure
|
| 160 |
+
|
| 161 |
+
- **Hardware:** One server with 16 脳 V100 GPUs and one server with 3 TB of RAM, both available at the CODA Lab.
|
| 162 |
+
- **Software:** PyTorch 2.0, Hugging Face Transformers 4.x, KV-Swarm (an in-memory database also hosted at the CODA Lab), Weight and Biases for experiment tracking and model management
|
| 163 |
+
|
| 164 |
+
## Citation
|
| 165 |
+
|
| 166 |
+
**BibTeX:**
|
| 167 |
+
|
| 168 |
+
```bibtex
|
| 169 |
+
@misc{mendieta2025labseviolence,
|
| 170 |
+
author = {Milton Mendieta, Timothy Warren},
|
| 171 |
+
title = {Fine-Tuning Multilingual Language Models to Predict Collective Violence Using Twitter Data},
|
| 172 |
+
year = {2025},
|
| 173 |
+
publisher = {Hugging Face},
|
| 174 |
+
howpublished = {\url{https://huggingface.co/m2im/labse_finetuned_twitter}},
|
| 175 |
+
note = {Research on multilingual NLP and conflict prediction}
|
| 176 |
+
}
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
## Citation
|
| 180 |
+
|
| 181 |
+
**APA:**
|
| 182 |
+
Mendieta, M., & Warren, T. (2025). *Fine-tuning multilingual language models to predict collective violence using Twitter data* [Model]. Hugging Face. https://huggingface.co/m2im/labse_finetuned_twitter
|
| 183 |
+
|
| 184 |
+
## Model Card Authors
|
| 185 |
+
|
| 186 |
+
Dr. Milton Mendieta and Dr. Timothy Warren
|
| 187 |
+
|
| 188 |
+
## Model Card Contact
|
| 189 |
+
|
| 190 |
+
mvmendie@espol.edu.ec
|