|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- group-detection |
|
|
- political-science |
|
|
- multilingual |
|
|
- multilabel-classification |
|
|
- deberta |
|
|
- group-appeals |
|
|
language: |
|
|
- en |
|
|
- de |
|
|
- nl |
|
|
- da |
|
|
- fr |
|
|
- es |
|
|
- it |
|
|
- sv |
|
|
base_model: microsoft/mdeberta-v3-base |
|
|
--- |
|
|
|
|
|
# Model Card for mDeBERTa Group Detection |
|
|
|
|
|
A multilingual group classification model fine-tuned for classifying social group tokens into meaningful social groups categories in political text. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This model is a fine-tuned mDeBERTa-v3-base that performs multilabel classification to classify social group tokens mentioned in political text into meaningful social groups categories. The model can classify a token into multiple group categories simultaneously to support intersectionality, and was trained on political manifesto data. |
|
|
|
|
|
- **Developed by:** Will Horne, Alona O. Dolinsky and Lena Maria Huber |
|
|
- **Model type:** Multilabel Sequence Classification |
|
|
- **Language(s) (NLP):** English, German (multilingual) |
|
|
- **Finetuned from model:** microsoft/mdeberta-v3-base |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** rwillh11/mdeberta_groups_2.0 |
|
|
- **Base Model:** [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
The model is designed for researchers analyzing political discourse to automatically classify **social group tokens or phrases** into meaningful social group categories. It takes individual group mentions (e.g., "workers", "students", "citizens") as input and outputs predictions for 44 different group categories: |
|
|
|
|
|
- Adults, Caregivers, Children, Citizens, Civil servants, Consumers |
|
|
- Crime victims, Criminals, Education professionals, Elderly people |
|
|
- Employees and workers, Employers and business owners, Ethnic and national communities |
|
|
- Families, Farmers, Health professionals, Homeless people, Homeowners and landowners |
|
|
- Investors and stakeholders, Landlords, Law enforcement personnel, LGBTQI |
|
|
- Lower class, Manual and service workers, Men, Middle class, Migrants and refugees |
|
|
- Military personnel, Patients, People with disabilities, Politicians, Religious communities |
|
|
- Road users, Rural communities, Sociocultural professionals, Students, Taxpayers |
|
|
- Tenants, Unemployed, Upper class, White collar workers, Women, Young people |
|
|
- and a residual category of "Other" |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
This model can be integrated into larger political text analysis pipelines for: |
|
|
- **Step 2 of group analysis**: After extracting group mentions from text, classify them into meaningful categories |
|
|
- Political manifestos analysis and group categorization |
|
|
- Comparative political research across countries and languages |
|
|
- Social group representation studies with consistent categorization |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
This model should not be used for: |
|
|
- **Detecting group mentions within full text** (this model classifies pre-identified group tokens) |
|
|
- General entity recognition or named entity recognition tasks |
|
|
- Processing full sentences or paragraphs directly |
|
|
- Real-time social media monitoring without human oversight |
|
|
- Making decisions about individuals or groups |
|
|
- Content moderation without additional validation |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
### Technical Limitations |
|
|
- Trained specifically on political manifesto text; performance may vary on other text types |
|
|
- Limited to 44 predefined group categories |
|
|
- Multilabel predictions may have dependencies between group categories |
|
|
|
|
|
### Bias Considerations |
|
|
- Training data consists of political manifestos from specific countries and time periods |
|
|
- May reflect biases present in political discourse of training data |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
Users should be aware that this model: |
|
|
- Is designed for research purposes in political science |
|
|
- Should be validated on specific domains before deployment |
|
|
- May require human oversight for sensitive applications |
|
|
- Performance may vary across different types of groups and political contexts |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
### Recommended Usage (Pipeline) |
|
|
|
|
|
```python |
|
|
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_repo = "rwillh11/mdeberta_groups_2.0" |
|
|
tokenizer = AutoTokenizer.from_pretrained("microsoft/mdeberta-v3-base") |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_repo) |
|
|
|
|
|
# Create pipeline for multilabel classification |
|
|
classifier = pipeline( |
|
|
"text-classification", |
|
|
model=model, |
|
|
tokenizer=tokenizer, |
|
|
return_all_scores=True, |
|
|
device=0 # Use GPU if available |
|
|
) |
|
|
|
|
|
# Example usage - classify group tokens/phrases |
|
|
group_tokens = ["students", "workers", "teachers", "citizens", "elderly people"] |
|
|
|
|
|
# Get predictions |
|
|
predictions = classifier(group_tokens) |
|
|
|
|
|
# Process results with 0.5 threshold |
|
|
for token, prediction in zip(group_tokens, predictions): |
|
|
predicted_labels = [label_score['label'] for label_score in prediction if label_score['score'] > 0.5] |
|
|
print(f"'{token}' → {predicted_labels}") |
|
|
``` |
|
|
|
|
|
### Manual Implementation |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "rwillh11/mdeberta_groups_2.0" |
|
|
tokenizer = AutoTokenizer.from_pretrained("microsoft/mdeberta-v3-base") |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Example group tokens |
|
|
group_tokens = ["workers", "citizens", "students"] |
|
|
|
|
|
for token in group_tokens: |
|
|
# Tokenize |
|
|
inputs = tokenizer(token, return_tensors="pt", truncation=True, max_length=128) |
|
|
|
|
|
# Get predictions |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.sigmoid(outputs.logits) |
|
|
|
|
|
# Apply threshold (0.5) to get binary predictions |
|
|
binary_predictions = (predictions > 0.5).cpu().numpy() |
|
|
|
|
|
# Get predicted label indices |
|
|
predicted_indices = [i for i, pred in enumerate(binary_predictions[0]) if pred] |
|
|
print(f"'{token}' predicted categories: {predicted_indices}") |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on political manifesto data containing: |
|
|
- **Languages:** English and German |
|
|
- **Text Type:** Political manifesto sentences and group mentions |
|
|
- **Labels:** Multiple social group categories (multilabel classification) |
|
|
- **Source:** `final_group_train.csv` |
|
|
- **Training Size:** 2,454 examples (80% split) |
|
|
- **Validation Size:** 614 examples (20% split) |
|
|
- **Data processing:** MultiLabelBinarizer for one-hot encoding of group labels |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing |
|
|
- Texts tokenized using mDeBERTa tokenizer with max length 128 |
|
|
- Multilabel binarization using scikit-learn's MultiLabelBinarizer |
|
|
- Each text can have multiple group labels simultaneously |
|
|
|
|
|
#### Training Hyperparameters (Optimal from Optuna) |
|
|
- **Training regime:** Mixed precision training with gradient accumulation |
|
|
- **Optimizer:** AdamW |
|
|
- **Learning rate:** 1.9432557585419205e-05 (optimized via Optuna) |
|
|
- **Weight decay:** 0.11740203810285466 (optimized via Optuna) |
|
|
- **Warmup ratio:** 0.018423412349675528 (optimized via Optuna) |
|
|
- **Epochs:** 30 |
|
|
- **Batch size:** 8 (train and eval) |
|
|
- **Gradient accumulation steps:** 2 |
|
|
- **Trials:** 7 Optuna trials for hyperparameter optimization |
|
|
- **Metric for selection:** F1 Score |
|
|
- **Seed:** 42 (partial deterministic training - only Transformers seed set) |
|
|
- **Pruning:** MedianPruner with 5 warmup steps |
|
|
|
|
|
#### Training Infrastructure |
|
|
- **Hardware:** CUDA-enabled GPU (Google Colab) |
|
|
- **Framework:** Transformers, PyTorch |
|
|
- **Hyperparameter optimization:** Optuna with MedianPruner |
|
|
- **Early stopping:** MedianPruner with 5 warmup steps |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
- 20% holdout from original dataset |
|
|
- Multilingual political manifesto sentences with group annotations |
|
|
|
|
|
#### Factors |
|
|
The model was evaluated across: |
|
|
- **Languages:** English and German text |
|
|
- **Group categories:** 44 different social group types |
|
|
- **Multilabel performance:** Ability to predict multiple groups per text |
|
|
|
|
|
#### Metrics |
|
|
Primary metrics used for evaluation: |
|
|
- **F1 Score:** Primary optimization metric for multilabel classification |
|
|
- **Accuracy:** Overall prediction accuracy |
|
|
- **Precision:** Precision across all labels |
|
|
- **Recall:** Recall across all labels |
|
|
|
|
|
### Results |
|
|
|
|
|
**Best Model Performance (Trial 4, Epoch 27):** |
|
|
- **Accuracy:** 0.9942 |
|
|
- **F1 Score:** 0.8537 |
|
|
- **Precision:** 0.8633 |
|
|
- **Recall:** 0.8443 |
|
|
|
|
|
The model demonstrates strong performance in multilabel group detection with consistent results across hyperparameter trials and excellent convergence during training. |
|
|
|
|
|
Additional validation on held-out sets return the following micro-averaged metrics excluding the residual category "other": |
|
|
|
|
|
**English** |
|
|
- **Precision:** 0.894 |
|
|
- **Recall:** 0.868 |
|
|
- **F1 Micro:** 0.881 |
|
|
|
|
|
**German (using texts translated from English)** |
|
|
- **Precision:** 0.853 |
|
|
- **Recall:** 0.823 |
|
|
- **F1 Micro:** 0.838 |
|
|
|
|
|
**Dutch (using texts translated from English)** |
|
|
- **Precision:** 0.833 |
|
|
- **Recall:** 0.789 |
|
|
- **F1 Micro:** 0.817 |
|
|
|
|
|
**Danish (using texts translated from English)** |
|
|
- **Precision:** 0.845 |
|
|
- **Recall:** 0.789 |
|
|
- **F1 Micro:** 0.816 |
|
|
|
|
|
**Spanish (using texts translated from English)** |
|
|
- **Precision:** 0.838 |
|
|
- **Recall:** 0.792 |
|
|
- **F1 Micro:** 0.815 |
|
|
|
|
|
**French (using texts translated from English)** |
|
|
- **Precision:** 0.841 |
|
|
- **Recall:** 0.802 |
|
|
- **F1 Micro:** 0.821 |
|
|
|
|
|
**Italian (using texts translated from English)** |
|
|
- **Precision:** 0.837 |
|
|
- **Recall:** 0.788 |
|
|
- **F1 Micro:** 0.811 |
|
|
|
|
|
**Swedish (using texts translated from English)** |
|
|
- **Precision:** 0.837 |
|
|
- **Recall:** 0.774 |
|
|
- **F1 Micro:** 0.804 |
|
|
|
|
|
## Model Examination |
|
|
|
|
|
The model uses a standard multilabel classification approach: |
|
|
- Sigmoid activation for independent probability prediction per group |
|
|
- Binary cross-entropy loss for multilabel training |
|
|
- Threshold of 0.5 for binary predictions |
|
|
- Supports detection of multiple groups simultaneously in a single text |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
Training involved hyperparameter optimization with 7 trials, each training for 30 epochs. |
|
|
|
|
|
- **Hardware Type:** CUDA-enabled GPU (Google Colab) |
|
|
- **Hours used:** Approximately 37-38 hours per trial (6 complete trials ≈ 4.5 hours each, ~27 total hours) |
|
|
- **Cloud Provider:** Google Colab |
|
|
- **Compute Region:** Variable |
|
|
- **Carbon Emitted:** Not precisely measured |
|
|
- **Training Date:** February 24, 2025 |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
- **Base Architecture:** mDeBERTa-v3-base (278M parameters) |
|
|
- **Task:** Multilabel sequence classification for group detection |
|
|
- **Input:** Political text (max length 128 tokens) |
|
|
- **Output:** Multi-dimensional binary vector for group presence |
|
|
- **Objective:** Binary cross-entropy loss with F1 score optimization |
|
|
- **Activation:** Sigmoid for independent probability prediction per group |
|
|
- **Threshold:** 0.5 for binary predictions |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
#### Hardware |
|
|
- GPU-accelerated training (CUDA) |
|
|
- Mixed precision training support |
|
|
|
|
|
#### Software |
|
|
- Transformers library |
|
|
- PyTorch framework |
|
|
- Optuna for hyperparameter optimization |
|
|
- scikit-learn for metrics and multilabel encoding |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
**BibTeX:** |
|
|
```bibtex |
|
|
@misc{mdeberta_groups_detection, |
|
|
title={mDeBERTa Group Detection Model for Political Text Analysis}, |
|
|
author={Will Horne and Alona O. Dolinsky and Lena Maria Huber}, |
|
|
year={2024}, |
|
|
note={Multilingual model for detecting social groups in political discourse} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Research team studying group appeals in political discourse. |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions about this model, please contact the research team through appropriate academic channels. |