m2im commited on
Commit
094e909
verified
1 Parent(s): 822073e

initial commit

Browse files
Files changed (1) hide show
  1. README.md +190 -0
README.md ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - multilabel-classification
5
+ - multilingual
6
+ - twitter
7
+ - violence-prediction
8
+ datasets:
9
+ - m2im/multilingual-twitter-collective-violence-dataset
10
+ language:
11
+ - multilingual
12
+ ---
13
+
14
+ # Model Card for m2im/labse_finetuned_twitter
15
+
16
+ This model is a fine-tuned version of LaBSE (Language-agnostic BERT Sentence Embedding), specifically adapted to detect collective violence signals in multilingual Twitter discourse. It was developed as part of a research project focused on early-warning systems for conflict prediction.
17
+
18
+ ## Model Details
19
+
20
+ ### Model Description
21
+
22
+ - **Developed by:** Dr. Milton Mendieta and Dr. Timothy Warren
23
+ - **Funded by:** Coalition for Open-Source Defense Analysis (CODA) Lab, Department of Defense Analysis, Naval Postgraduate School (NPS)
24
+ - **Shared by:** Dr. Milton Mendieta and Dr. Timothy Warren
25
+ - **Model type:** Transformer-based sentence encoder fine-tuned for multilabel classification
26
+ - **Language(s):** Originally pre-trained on 109 languages (LaBSE), then fine-tuned on 68 languages from X (formerly Twitter, 2014 onward), including the undefined `und` language category
27
+
28
+ - **License:** MIT
29
+ - **Finetuned from model:** [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE)
30
+
31
+ ### Model Sources
32
+
33
+ - **Repository:** [https://github.com/m2im/violence_prediction](https://github.com/m2im/violence_prediction)
34
+ - **Paper:** TBD
35
+
36
+ ## Uses
37
+
38
+ ### Direct Use
39
+
40
+ This model is intended to classify tweets in multiple languages into predefined categories related to proximity to collective violence events.
41
+
42
+ ### Downstream Use
43
+
44
+ The model may be embedded into conflict early-warning systems, government monitoring platforms, or research pipelines analyzing social unrest.
45
+
46
+ ### Out-of-Scope Use
47
+
48
+ - General-purpose sentiment analysis
49
+ - Legal, health, or financial decision-making
50
+ - Use in low-resource languages not covered by training data
51
+
52
+ ## Bias, Risks, and Limitations
53
+
54
+ - **Geographic bias**: The model was primarily trained on short-duration violent events around the world, which limits its applicability to long-running conflicts (e.g., Russia-Ukraine) or high-noise environments (e.g., Washington, D.C.).
55
+ - **Temporal bias**: Performance degrades in pre-violence scenarios, especially at larger spatial scales (50 km), where signals are weaker and often masked by noise.
56
+ - **Sample size sensitivity**: The model underperforms when fewer than 5,000 observations are available per label, reducing reliability in low-data settings.
57
+ - **Spatial ambiguity**: Frequent misclassification between `pre7geo50` and `post7geo50` labels highlights the model鈥檚 challenge in distinguishing temporal contexts at broader spatial radii.
58
+ - **Language coverage limitations**: While fine-tuned on 67 languages, performance may vary for underrepresented or informal language variants.
59
+
60
+ ## Recommendations
61
+
62
+ - **Use with short-term events**: For best results, apply the model to short-term events with geographically concentrated discourse, aligning with the training data distribution.
63
+ - **Avoid low-sample inference**: Do not deploy the model in scenarios where fewer than 5,000 labeled observations are available per class.
64
+ - **Limit reliance on large-radius labels**: Exercise caution when interpreting predictions at 50 km radii, which tend to capture noisy or irrelevant information.
65
+ - **Contextual validation**: Evaluate model performance on local data before broader deployment, especially in unfamiliar regions or languages.
66
+ - **Consider post-processing**: Incorporate ensemble methods or threshold adjustments to improve label differentiation in ambiguous cases.
67
+ - **Batch predictions**: Avoid use in isolated tweets; batch predictions are more reliable
68
+
69
+ ## How to Get Started with the Model
70
+
71
+ ```python
72
+ from transformers import pipeline
73
+ import html, re
74
+
75
+ def clean_tweet(example):
76
+ tweet = example['text']
77
+ tweet = tweet.replace("\n", " ")
78
+ tweet = html.unescape(tweet)
79
+ tweet = re.sub("@[A-Za-z0-9_:]+", "", tweet)
80
+ tweet = re.sub(r'http\S+', '', tweet)
81
+ tweet = re.sub('RT ', '', tweet)
82
+ return {'text': tweet.strip()}
83
+
84
+ pipe = pipeline("text-classification", model="m2im/labse_finetuned_twitter", tokenizer="m2im/labse_finetuned_twitter", top_k=None)
85
+
86
+ example = {"text": "Protesta en Quito por medidas econ贸micas."}
87
+ cleaned = clean_tweet(example)
88
+ print(pipe(cleaned["text"]))
89
+ ```
90
+
91
+ ## Training Details
92
+
93
+ ### Training Data
94
+
95
+ - Dataset: [m2im/multilingual-twitter-collective-violence-dataset](https://huggingface.co/datasets/m2im/multilingual-twitter-collective-violence-dataset)
96
+ - Labels: 6 of the most informative out of 40 available:
97
+ - `pre7geo10`, `pre7geo30`, `pre7geo50`
98
+ - `post7geo10`, `post7geo30`, `post7geo50`
99
+
100
+ ### Training Procedure
101
+
102
+ - Text preprocessing using tweet normalization (removal of mentions, URLs, etc.)
103
+ - Tokenization with LaBSE tokenizer
104
+ - Multi-label head using `BCEWithLogitsLoss`
105
+
106
+ #### Training Hyperparameters
107
+
108
+ - Model checkpoints: `setu4993/LaBSE`
109
+ - Head class: `AutoModelForSequenceClassification`
110
+ - Optimizer: AdamW
111
+ - Batch size (train/validation): 1024
112
+ - Epochs: 20
113
+ - Learning rate: 5e-5
114
+ - Learning rate scheduler: Cosine
115
+ - Weight decay: 0.1
116
+ - Max sequence length: 32
117
+ - Precision: Mixed fp16
118
+ - Random seed: 42
119
+ - Saving strategy: Save the best model only when the ROC-AUC score improves on the validation set
120
+
121
+ ## Evaluation
122
+
123
+ ### Testing Data, Factors & Metrics
124
+
125
+ - **Dataset**: Held-out portion of the multilingual Twitter collective violence dataset, including over 275,000 tweets labeled across six spatio-temporal categories (`pre7geo10`, `pre7geo30`, `pre7geo50`, `post7geo10`, `post7geo30`, `post7geo50`).
126
+ - **Metrics**:
127
+ - **ROC-AUC** (Receiver Operating Characteristic - Area Under Curve): Evaluates the model鈥檚 ability to distinguish between classes across all thresholds.
128
+ - **Macro F1**: Harmonic mean of precision and recall, averaged equally across all classes.
129
+ - **Micro F1**: Harmonic mean of precision and recall, aggregated globally across all predictions.
130
+ - **Precision** and **Recall**: Standard classification metrics to assess false positive and false negative trade-offs.
131
+
132
+ ### Results
133
+
134
+ - Classical ML models (Random Forest, SVM, Bagging, Boosting, and Decision Trees) were trained on LaBSE-generated sentence embeddings. The best performing classical model---Random Forest---achieved a **macro F1 score of approximately 0.61**, indicating that embeddings alone provide meaningful but limited discrimination for the multilabel classification task.
135
+ - In contrast, the **fine-tuned LaBSE model**, trained end-to-end with a classification head, outperformed all baseline classical models by achieving a **ROC-AUC score of 0.7238** on the validation set and **0.6988** on the test set.
136
+ - These results demonstrate the value of supervised fine-tuning over using frozen embeddings with classical classifiers, particularly in tasks involving subtle multilingual and spatio-temporal signal detection.
137
+
138
+ ## Model Examination
139
+
140
+ - Embedding analysis was conducted using a two-stage dimensionality reduction process: Principal Component Analysis (PCA) reduced the 768-dimensional LaBSE sentence embeddings to 50 dimensions, followed by Uniform Manifold Approximation and Projection (UMAP) to reduce to 2 dimensions for visualization.
141
+ - The resulting 2D projections revealed coherent clustering of sentence embeddings by label, particularly in post-violence scenarios and at smaller spatial scales (10 km), indicating that the model effectively captures latent structure related to spatio-temporal patterns of collective violence.
142
+ - Examination of classification performance across labels further confirmed that the model is most reliable when predicting post-violence instances near the epicenter of an event, while its ability to detect pre-violence signals---especially at broader spatial radii (50 km)---is weaker and more prone to noise.
143
+
144
+ ## Environmental Impact
145
+
146
+ - **Hardware Type:** 16 NVIDIA Tesla V100 GPUs
147
+ - **Hours used:** ~10 hours
148
+ - **Cloud Provider:** University research computing cluster
149
+ - **Compute Region:** North America
150
+ - **Carbon Emitted:** Not formally calculated
151
+
152
+ ## Technical Specifications
153
+
154
+ ### Model Architecture and Objective
155
+
156
+ - Transformer encoder (BERT-based)
157
+ - Objective: Multilabel binary classification with sentence embeddings
158
+
159
+ ### Compute Infrastructure
160
+
161
+ - **Hardware:** One server with 16 脳 V100 GPUs and one server with 3 TB of RAM, both available at the CODA Lab.
162
+ - **Software:** PyTorch 2.0, Hugging Face Transformers 4.x, KV-Swarm (an in-memory database also hosted at the CODA Lab), Weight and Biases for experiment tracking and model management
163
+
164
+ ## Citation
165
+
166
+ **BibTeX:**
167
+
168
+ ```bibtex
169
+ @misc{mendieta2025labseviolence,
170
+ author = {Milton Mendieta, Timothy Warren},
171
+ title = {Fine-Tuning Multilingual Language Models to Predict Collective Violence Using Twitter Data},
172
+ year = {2025},
173
+ publisher = {Hugging Face},
174
+ howpublished = {\url{https://huggingface.co/m2im/labse_finetuned_twitter}},
175
+ note = {Research on multilingual NLP and conflict prediction}
176
+ }
177
+ ```
178
+
179
+ ## Citation
180
+
181
+ **APA:**
182
+ Mendieta, M., & Warren, T. (2025). *Fine-tuning multilingual language models to predict collective violence using Twitter data* [Model]. Hugging Face. https://huggingface.co/m2im/labse_finetuned_twitter
183
+
184
+ ## Model Card Authors
185
+
186
+ Dr. Milton Mendieta and Dr. Timothy Warren
187
+
188
+ ## Model Card Contact
189
+
190
+ mvmendie@espol.edu.ec