mdeberta_groups_2.0 / README.md
Alonadoli's picture
Update README.dm
801942b verified
|
raw
history blame
11.6 kB
---
library_name: transformers
tags:
- group-detection
- political-science
- multilingual
- multilabel-classification
- deberta
- group-appeals
language:
- en
- de
- nl
- da
- fr
- es
- it
- sv
base_model: microsoft/mdeberta-v3-base
---
# Model Card for mDeBERTa Group Detection
A multilingual group classification model fine-tuned for classifying social group tokens into meaningful social groups categories in political text.
## Model Details
### Model Description
This model is a fine-tuned mDeBERTa-v3-base that performs multilabel classification to classify social group tokens mentioned in political text into meaningful social groups categories. The model can classify a token into multiple group categories simultaneously to support intersectionality, and was trained on political manifesto data.
- **Developed by:** Will Horne, Alona O. Dolinsky and Lena Maria Huber
- **Model type:** Multilabel Sequence Classification
- **Language(s) (NLP):** English, German (multilingual)
- **Finetuned from model:** microsoft/mdeberta-v3-base
### Model Sources
- **Repository:** rwillh11/mdeberta_groups_2.0
- **Base Model:** [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base)
## Uses
### Direct Use
The model is designed for researchers analyzing political discourse to automatically classify **social group tokens or phrases** into meaningful social group categories. It takes individual group mentions (e.g., "workers", "students", "citizens") as input and outputs predictions for 44 different group categories:
- Adults, Caregivers, Children, Citizens, Civil servants, Consumers
- Crime victims, Criminals, Education professionals, Elderly people
- Employees and workers, Employers and business owners, Ethnic and national communities
- Families, Farmers, Health professionals, Homeless people, Homeowners and landowners
- Investors and stakeholders, Landlords, Law enforcement personnel, LGBTQI
- Lower class, Manual and service workers, Men, Middle class, Migrants and refugees
- Military personnel, Patients, People with disabilities, Politicians, Religious communities
- Road users, Rural communities, Sociocultural professionals, Students, Taxpayers
- Tenants, Unemployed, Upper class, White collar workers, Women, Young people
- and a residual category of "Other"
### Downstream Use
This model can be integrated into larger political text analysis pipelines for:
- **Step 2 of group analysis**: After extracting group mentions from text, classify them into meaningful categories
- Political manifestos analysis and group categorization
- Comparative political research across countries and languages
- Social group representation studies with consistent categorization
### Out-of-Scope Use
This model should not be used for:
- **Detecting group mentions within full text** (this model classifies pre-identified group tokens)
- General entity recognition or named entity recognition tasks
- Processing full sentences or paragraphs directly
- Real-time social media monitoring without human oversight
- Making decisions about individuals or groups
- Content moderation without additional validation
## Bias, Risks, and Limitations
### Technical Limitations
- Trained specifically on political manifesto text; performance may vary on other text types
- Limited to 44 predefined group categories
- Multilabel predictions may have dependencies between group categories
### Bias Considerations
- Training data consists of political manifestos from specific countries and time periods
- May reflect biases present in political discourse of training data
### Recommendations
Users should be aware that this model:
- Is designed for research purposes in political science
- Should be validated on specific domains before deployment
- May require human oversight for sensitive applications
- Performance may vary across different types of groups and political contexts
## How to Get Started with the Model
### Recommended Usage (Pipeline)
```python
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
# Load model and tokenizer
model_repo = "rwillh11/mdeberta_groups_2.0"
tokenizer = AutoTokenizer.from_pretrained("microsoft/mdeberta-v3-base")
model = AutoModelForSequenceClassification.from_pretrained(model_repo)
# Create pipeline for multilabel classification
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
return_all_scores=True,
device=0 # Use GPU if available
)
# Example usage - classify group tokens/phrases
group_tokens = ["students", "workers", "teachers", "citizens", "elderly people"]
# Get predictions
predictions = classifier(group_tokens)
# Process results with 0.5 threshold
for token, prediction in zip(group_tokens, predictions):
predicted_labels = [label_score['label'] for label_score in prediction if label_score['score'] > 0.5]
print(f"'{token}' → {predicted_labels}")
```
### Manual Implementation
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "rwillh11/mdeberta_groups_2.0"
tokenizer = AutoTokenizer.from_pretrained("microsoft/mdeberta-v3-base")
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example group tokens
group_tokens = ["workers", "citizens", "students"]
for token in group_tokens:
# Tokenize
inputs = tokenizer(token, return_tensors="pt", truncation=True, max_length=128)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.sigmoid(outputs.logits)
# Apply threshold (0.5) to get binary predictions
binary_predictions = (predictions > 0.5).cpu().numpy()
# Get predicted label indices
predicted_indices = [i for i, pred in enumerate(binary_predictions[0]) if pred]
print(f"'{token}' predicted categories: {predicted_indices}")
```
## Training Details
### Training Data
The model was trained on political manifesto data containing:
- **Languages:** English and German
- **Text Type:** Political manifesto sentences and group mentions
- **Labels:** Multiple social group categories (multilabel classification)
- **Source:** `final_group_train.csv`
- **Training Size:** 2,454 examples (80% split)
- **Validation Size:** 614 examples (20% split)
- **Data processing:** MultiLabelBinarizer for one-hot encoding of group labels
### Training Procedure
#### Preprocessing
- Texts tokenized using mDeBERTa tokenizer with max length 128
- Multilabel binarization using scikit-learn's MultiLabelBinarizer
- Each text can have multiple group labels simultaneously
#### Training Hyperparameters (Optimal from Optuna)
- **Training regime:** Mixed precision training with gradient accumulation
- **Optimizer:** AdamW
- **Learning rate:** 1.9432557585419205e-05 (optimized via Optuna)
- **Weight decay:** 0.11740203810285466 (optimized via Optuna)
- **Warmup ratio:** 0.018423412349675528 (optimized via Optuna)
- **Epochs:** 30
- **Batch size:** 8 (train and eval)
- **Gradient accumulation steps:** 2
- **Trials:** 7 Optuna trials for hyperparameter optimization
- **Metric for selection:** F1 Score
- **Seed:** 42 (partial deterministic training - only Transformers seed set)
- **Pruning:** MedianPruner with 5 warmup steps
#### Training Infrastructure
- **Hardware:** CUDA-enabled GPU (Google Colab)
- **Framework:** Transformers, PyTorch
- **Hyperparameter optimization:** Optuna with MedianPruner
- **Early stopping:** MedianPruner with 5 warmup steps
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- 20% holdout from original dataset
- Multilingual political manifesto sentences with group annotations
#### Factors
The model was evaluated across:
- **Languages:** English and German text
- **Group categories:** 44 different social group types
- **Multilabel performance:** Ability to predict multiple groups per text
#### Metrics
Primary metrics used for evaluation:
- **F1 Score:** Primary optimization metric for multilabel classification
- **Accuracy:** Overall prediction accuracy
- **Precision:** Precision across all labels
- **Recall:** Recall across all labels
### Results
**Best Model Performance (Trial 4, Epoch 27):**
- **Accuracy:** 0.9942
- **F1 Score:** 0.8537
- **Precision:** 0.8633
- **Recall:** 0.8443
The model demonstrates strong performance in multilabel group detection with consistent results across hyperparameter trials and excellent convergence during training.
Additional validation on held-out sets return the following micro-averaged metrics excluding the residual category "other":
**English**
- **Precision:** 0.894
- **Recall:** 0.868
- **F1 Micro:** 0.881
**German (using texts translated from English)**
- **Precision:** 0.853
- **Recall:** 0.823
- **F1 Micro:** 0.838
**Dutch (using texts translated from English)**
- **Precision:** 0.833
- **Recall:** 0.789
- **F1 Micro:** 0.817
**Danish (using texts translated from English)**
- **Precision:** 0.845
- **Recall:** 0.789
- **F1 Micro:** 0.816
**Spanish (using texts translated from English)**
- **Precision:** 0.838
- **Recall:** 0.792
- **F1 Micro:** 0.815
**French (using texts translated from English)**
- **Precision:** 0.841
- **Recall:** 0.802
- **F1 Micro:** 0.821
**Italian (using texts translated from English)**
- **Precision:** 0.837
- **Recall:** 0.788
- **F1 Micro:** 0.811
**Swedish (using texts translated from English)**
- **Precision:** 0.837
- **Recall:** 0.774
- **F1 Micro:** 0.804
## Model Examination
The model uses a standard multilabel classification approach:
- Sigmoid activation for independent probability prediction per group
- Binary cross-entropy loss for multilabel training
- Threshold of 0.5 for binary predictions
- Supports detection of multiple groups simultaneously in a single text
## Environmental Impact
Training involved hyperparameter optimization with 7 trials, each training for 30 epochs.
- **Hardware Type:** CUDA-enabled GPU (Google Colab)
- **Hours used:** Approximately 37-38 hours per trial (6 complete trials ≈ 4.5 hours each, ~27 total hours)
- **Cloud Provider:** Google Colab
- **Compute Region:** Variable
- **Carbon Emitted:** Not precisely measured
- **Training Date:** February 24, 2025
## Technical Specifications
### Model Architecture and Objective
- **Base Architecture:** mDeBERTa-v3-base (278M parameters)
- **Task:** Multilabel sequence classification for group detection
- **Input:** Political text (max length 128 tokens)
- **Output:** Multi-dimensional binary vector for group presence
- **Objective:** Binary cross-entropy loss with F1 score optimization
- **Activation:** Sigmoid for independent probability prediction per group
- **Threshold:** 0.5 for binary predictions
### Compute Infrastructure
#### Hardware
- GPU-accelerated training (CUDA)
- Mixed precision training support
#### Software
- Transformers library
- PyTorch framework
- Optuna for hyperparameter optimization
- scikit-learn for metrics and multilabel encoding
## Citation
If you use this model in your research, please cite:
**BibTeX:**
```bibtex
@misc{mdeberta_groups_detection,
title={mDeBERTa Group Detection Model for Political Text Analysis},
author={Will Horne and Alona O. Dolinsky and Lena Maria Huber},
year={2024},
note={Multilingual model for detecting social groups in political discourse}
}
```
## Model Card Authors
Research team studying group appeals in political discourse.
## Model Card Contact
For questions about this model, please contact the research team through appropriate academic channels.