m2im
/

smallLabse_finetuned_twitter

+---
+license: mit
+tags:
+- multilabel-classification
+- multilingual
+- twitter
+- violence-prediction
+datasets:
+- m2im/multilingual-twitter-collective-violence-dataset
+language:
+- multilingual
+---
+# Model Card for m2im/smaller_labse_finetuned_twitter
+This model is a fine-tuned version of smaller-LaBSE (a distilled variant of LaBSE), specifically adapted to detect collective violence signals in multilingual Twitter discourse. It was developed as part of a research project focused on early-warning systems for conflict prediction.
+## Model Details
+### Model Description
+- **Developed by:** Dr. Milton Mendieta and Dr. Timothy Warren
+- **Funded by:** Coalition for Open-Source Defense Analysis (CODA) Lab, Department of Defense Analysis, Naval Postgraduate School (NPS)
+- **Shared by:** Dr. Milton Mendieta and Dr. Timothy Warren
+- **Model type:** Transformer-based sentence encoder fine-tuned for multilabel classification
+- **Language(s):** The smaller Language-agnostic BERT Sentence Encoder (smaller-LaBSE) is a distilled version of the original LaBSE model, initially trained on 15 languages. It was subsequently fine-tuned on multilingual social media data from X (formerly Twitter), covering 68 languages from 2014 onward, including the undefined `und` language category.
+- **License:** MIT
+- **Finetuned from model:** [setu4993/smaller-LaBSE](https://huggingface.co/setu4993/smaller-LaBSE)
+### Model Sources
+- **Repository:** [https://github.com/m2im/violence_prediction](https://github.com/m2im/violence_prediction)
+- **Paper:** TBD
+## Uses
+### Direct Use
+This model is intended to classify tweets in multiple languages into predefined categories related to proximity to collective violence events.
+### Downstream Use
+The model may be embedded into conflict early-warning systems, government monitoring platforms, or research pipelines analyzing social unrest.
+### Out-of-Scope Use
+- General-purpose sentiment analysis
+- Legal, health, or financial decision-making
+- Use in low-resource languages not covered by training data
+## Bias, Risks, and Limitations
+- **Geographic bias**: The model was primarily trained on short-duration violent events around the world, which limits its applicability to long-running conflicts (e.g., Russia-Ukraine) or high-noise environments (e.g., Washington, D.C.).
+- **Temporal bias**: Performance degrades in pre-violence scenarios, especially at larger spatial scales (50 km), where signals are weaker and often masked by noise.
+- **Sample size sensitivity**: The model underperforms when fewer than 5,000 observations are available per label, reducing reliability in low-data settings.
+- **Spatial ambiguity**: Frequent misclassification between `pre7geo50` and `post7geo50` labels highlights the model’s challenge in distinguishing temporal contexts at broader spatial radii.
+- **Language coverage limitations**: While fine-tuned on 67 languages, performance may vary for underrepresented or informal language variants.
+## Recommendations
+- **Use with short-term events**: For best results, apply the model to short-term events with geographically concentrated discourse, aligning with the training data distribution.
+- **Avoid low-sample inference**: Do not deploy the model in scenarios where fewer than 5,000 labeled observations are available per class.
+- **Limit reliance on large-radius labels**: Exercise caution when interpreting predictions at 50 km radii, which tend to capture noisy or irrelevant information.
+- **Contextual validation**: Evaluate model performance on local data before broader deployment, especially in unfamiliar regions or languages.
+- **Consider post-processing**: Incorporate ensemble methods or threshold adjustments to improve label differentiation in ambiguous cases.
+- **Batch predictions**: Avoid use in isolated tweets; batch predictions are more reliable
+## How to Get Started with the Model
+```python
+from transformers import pipeline
+import html, re
+def clean_tweet(example):
+    tweet = example['text']
+    tweet = tweet.replace("\n", " ")
+    tweet = html.unescape(tweet)
+    tweet = re.sub("@[A-Za-z0-9_:]+", "", tweet)
+    tweet = re.sub(r'http\S+', '', tweet)
+    tweet = re.sub('RT ', '', tweet)
+    return {'text': tweet.strip()}
+pipe = pipeline("text-classification", model="m2im/smaller_labse_finetuned_twitter", tokenizer="m2im/smaller_labse_finetuned_twitter", top_k=None)
+example = {"text": "Protesta en Quito por medidas económicas."}
+cleaned = clean_tweet(example)
+print(pipe(cleaned["text"]))
+```
+## Training Details
+### Training Data
+- Dataset: [m2im/multilingual-twitter-collective-violence-dataset](https://huggingface.co/datasets/m2im/multilingual-twitter-collective-violence-dataset)
+- Labels: 6 of the most informative out of 40 available:
+  - `pre7geo10`, `pre7geo30`, `pre7geo50`
+  - `post7geo10`, `post7geo30`, `post7geo50`
+### Training Procedure
+- Text preprocessing using tweet normalization (removal of mentions, URLs, etc.)
+- Tokenization with smaller-LaBSE tokenizer
+- Multi-label head using `BCEWithLogitsLoss`
+#### Training Hyperparameters
+- Model checkpoints: `setu4993/smaller-LaBSE`
+- Head class: `AutoModelForSequenceClassification`
+- Optimizer: AdamW
+- Batch size (train/validation): 1024
+- Epochs: 20
+- Learning rate: 5e-5
+- Learning rate scheduler: Cosine
+- Weight decay: 0.1
+- Max sequence length: 32
+- Precision: Mixed fp16
+- Random seed: 42
+- Saving strategy: Save the best model only when the ROC-AUC score improves on the validation set
+## Evaluation
+### Testing Data, Factors & Metrics
+- **Dataset**: Held-out portion of the multilingual Twitter collective violence dataset, including over 275,000 tweets labeled across six spatio-temporal categories (`pre7geo10`, `pre7geo30`, `pre7geo50`, `post7geo10`, `post7geo30`, `post7geo50`).
+- **Metrics**:
+  - **ROC-AUC** (Receiver Operating Characteristic - Area Under Curve): Evaluates the model’s ability to distinguish between classes across all thresholds.
+  - **Macro F1**: Harmonic mean of precision and recall, averaged equally across all classes.
+  - **Micro F1**: Harmonic mean of precision and recall, aggregated globally across all predictions.
+  - **Precision** and **Recall**: Standard classification metrics to assess false positive and false negative trade-offs.
+### Results
+- Classical ML models (Random Forest, SVM, Bagging, Boosting, and Decision Trees) were trained on smaller-LaBSE-generated sentence embeddings. The best performing classical model&mdash;Random Forest&mdash;achieved a **macro F1 score of approximately 0.61**, indicating that embeddings alone provide meaningful but limited discrimination for the multilabel classification task.
+- In contrast, the **fine-tuned smaller-LaBSE model**, trained end-to-end with a classification head, outperformed all baseline classical models by achieving a **ROC-AUC score of 0.7246** on the validation set.
+- These results demonstrate the value of supervised fine-tuning over using frozen embeddings with classical classifiers, particularly in tasks involving subtle multilingual and spatio-temporal signal detection.
+## Model Examination
+- Embedding analysis was conducted using a two-stage dimensionality reduction process: Principal Component Analysis (PCA) reduced the 768-dimensional smaller-LaBSE sentence embeddings to 50 dimensions, followed by Uniform Manifold Approximation and Projection (UMAP) to reduce to 2 dimensions for visualization.
+- The resulting 2D projections revealed coherent clustering of sentence embeddings by label, particularly in post-violence scenarios and at smaller spatial scales (10 km), indicating that the model effectively captures latent structure related to spatio-temporal patterns of collective violence.
+- Examination of classification performance across labels further confirmed that the model is most reliable when predicting post-violence instances near the epicenter of an event, while its ability to detect pre-violence signals&mdash;especially at broader spatial radii (50 km)&mdash;is weaker and more prone to noise.
+## Environmental Impact
+- **Hardware Type:** 16 NVIDIA Tesla V100 GPUs
+- **Hours used:** ~10 hours
+- **Cloud Provider:** University research computing cluster
+- **Compute Region:** North America
+- **Carbon Emitted:** Not formally calculated
+## Technical Specifications
+### Model Architecture and Objective
+- Transformer encoder (BERT-based)
+- Objective: Multilabel binary classification with sentence embeddings
+### Compute Infrastructure
+- **Hardware:** One server with 16 × V100 GPUs and one server with 3 TB of RAM, both available at the CODA Lab.
+- **Software:** PyTorch 2.0, Hugging Face Transformers 4.x, KV-Swarm (an in-memory database also hosted at the CODA Lab), Weight and Biases for experiment tracking and model management
+## Citation
+**BibTeX:**
+```bibtex
+@misc{mendieta2025labseviolence,
+  author       = {Milton Mendieta, Timothy Warren},
+  title        = {Fine-Tuning Multilingual Language Models to Predict Collective Violence Using Twitter Data},
+  year         = {2025},
+  publisher    = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/m2im/smaller_labse_finetuned_twitter}},
+  note         = {Research on multilingual NLP and conflict prediction}
+}
+```
+## Citation
+**APA:**
+Mendieta, M., & Warren, T. (2025). *Fine-tuning multilingual language models to predict collective violence using Twitter data* [Model]. Hugging Face. https://huggingface.co/m2im/smaller_labse_finetuned_twitter
+## Model Card Authors
+Dr. Milton Mendieta and Dr. Timothy Warren
+## Model Card Contact
+mvmendie@espol.edu.ec