Model Card for mDeBERTa Token Classification
A multilingual token classification model fine-tuned for detecting and extracting group mentions in political text.
Model Details
Model Description
This model is a fine-tuned mDeBERTa-v3-base that performs token-level classification to identify group mentions in political text. The model processes text units without additional context and predicts BIO (Begin-Inside-Outside) tags for token-level group mention detection.
- Developed by: Will Horne, Alona O. Dolinsky and Lena Maria Huber
- Model type: Token Classification
- Language(s) (NLP): English, German (multilingual)
- Finetuned from model: microsoft/mdeberta-v3-base
Model Sources
- Repository: rwillh11/mdberta-token-bilingual-noContext_Enhanced
- Base Model: microsoft/mdeberta-v3-base
Uses
Direct Use
The model is designed for researchers analyzing political discourse to identify specific group mentions in political text, trained and validated using party manifestos. It takes natural sentences as input and identifies spans of text that refer to specific target groups.
Downstream Use
This model can be integrated into larger political text analysis pipelines for:
- Political manifestos analysis
- Group appeals detection in political communication
- Comparative political research across countries and languages
- Preprocessing step for stance detection models
Out-of-Scope Use
This model should not be used for:
- General named entity recognition (not group-specific)
- Real-time social media monitoring without human oversight
- Making decisions about individuals or groups
- Content moderation without additional validation
Bias, Risks, and Limitations
Technical Limitations
- Trained specifically on political manifesto text; performance may vary on other text types
- Focus sentences without context may lack nuance present in full paragraphs
- Limited to binary classification (group mention vs. non-group mention)
Bias Considerations
- Training data consists of political manifestos from specific countries and time periods
- May reflect biases present in political discourse of training data
- Group detection may vary across different political contexts and group types
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model_name = "rwillh11/mdberta-token-bilingual-noContext_Enhanced"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
# Example usage
text = "We will increase funding for schools to better support students and teachers."
def extract_group_mentions(text):
# Tokenize with offset mapping
encoded = tokenizer(text, return_offsets_mapping=True, truncation=True,
padding='max_length', max_length=512, return_tensors="pt")
# Move to device
input_ids = encoded['input_ids'].to(device)
attention_mask = encoded['attention_mask'].to(device)
offset_mapping = encoded['offset_mapping'][0]
# Get predictions
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs.logits, dim=-1)[0]
# Convert to character spans
group_mentions = []
current_start = None
for i, (pred, (start_char, end_char)) in enumerate(zip(predictions, offset_mapping)):
if start_char == end_char: # Skip special tokens
continue
if pred == 1: # Group mention token
if current_start is None:
current_start = start_char.item()
current_end = end_char.item()
else: # Non-group token
if current_start is not None:
mention_text = text[current_start:current_end]
group_mentions.append((current_start, current_end, mention_text))
current_start = None
# Handle mention at end of sequence
if current_start is not None:
mention_text = text[current_start:current_end]
group_mentions.append((current_start, current_end, mention_text))
return group_mentions
# Extract group mentions
mentions = extract_group_mentions(text)
print(f"Group mentions found: {mentions}")
# Output: [(63, 71, 'students'), (76, 84, 'teachers')]
Training Details
Training Data
The model was trained on political manifesto data containing:
- Languages: English and German (bilingual)
- Text Type: Political manifesto sentences at natural sentence level
- Labels: BIO token classification (B-GROUP, I-GROUP, O)
- Groups: Various political target groups (citizens, specific demographics, etc.)
- Total Dataset: 7,589 examples
- German: 4,753 examples (62.6%)
- English: 2,836 examples (37.4%)
- Training Split: 80/20 deterministic split (seed=42)
- Training Set: ~6,071 examples (German: ~3,802, English: ~2,269)
- Validation Set: ~1,518 examples (German: ~951, English: ~567)
Training Hyperparameters
- Training regime: Mixed precision training
- Optimizer: AdamW with weight decay
- Learning rate: 2.92e-05 (optimized via Optuna)
- Weight decay: 0.282 (optimized via Optuna)
- Warmup ratio: 0.098 (optimized via Optuna)
- Epochs: 10 per trial
- Batch size: 16 (train and eval)
- Trials: 20 total hyperparameter optimization trials
- Metric for selection: F1 Macro
- Seed: 42 (deterministic training)
Training Infrastructure
- Hardware: CUDA-enabled GPU
- Framework: Transformers, PyTorch
- Hyperparameter optimization: Optuna
- Deterministic training: All random seeds fixed
Evaluation
Testing Data, Factors & Metrics
Testing Data
- 20% holdout from original dataset (~1,518 examples)
- Multilingual political manifesto sentences (62.6% German, 37.4% English)
- Evaluated across entity classes and both languages
Factors
The model was evaluated across:
- Languages: English and German text
- Additional validation on held out sets: English, German, Dutch, Danish, Spanish, French, Italian, Norwegian, Swedish
- Group types: Various political target groups
Results
Best Model Performance (Trial 10 Epoch 7):
- Accuracy: 0.976
- Balanced Accuracy: 0.961
- Precision: 0.953
- Recall: 0.961
- F1 Macro: 0.957
Additional validation on held-out sets return the following metrics:
English
- Precision: 0.93
- Recall: 0.912
- F1 Macro: 0.921
German (using spaCy to accommodate relevant morphological differences)
- Precision: 0.889
- Recall: 0.868
- F1 Macro: 0.878
Danish (using texts translated from English)
- Precision: 0.836
- Recall: 0.818
- F1 Macro: 0.827
Spanish (using texts translated from English)
- Precision: 0.907
- Recall: 0.889
- F1 Macro: 0.898
Dutch (using texts translated from English)
- Precision: 0.875
- Recall: 0.891
- F1 Macro: 0.883
French (using texts translated from English)
- Precision: 0.877
- Recall: 0.875
- F1 Macro: 0.876
Italian (using texts translated from English)
- Precision: 0.916
- Recall: 0.899
- F1 Macro: 0.908
Swedish (using texts translated from English)
- Precision: 0.867
- Recall: 0.853
- F1 Macro: 0.86
The model demonstrates excellent performance in identifying group mentions at the token level with strong precision and recall across both languages.
Environmental Impact
Training involved hyperparameter optimization with 20 trials, each training for 10 epochs.
- Hardware Type: CUDA-enabled GPU
- Hours used: Estimated 15-20 hours (including hyperparameter search)
- Cloud Provider: Google Colab
Technical Specifications
Model Architecture and Objective
- Base Architecture: mDeBERTa-v3-base (278M parameters)
- Task: Token Classification for Named Entity Recognition
- Input: Political sentence tokens
- Objective: Cross-entropy loss with F1 Macro optimization
Software
- Transformers library
- PyTorch framework
- Optuna for hyperparameter optimization
- scikit-learn for metrics
Citation
If you use this model in your research, please cite:
BibTeX:
@misc{mdberta_token_nocontext_enhanced,
title={mDeBERTa Token Classification Model for Political Group Appeals Detection},
author={Horne, Will and Alona Dolinsky and Lena Maria Huber},
year={2025},
url={https://huggingface.co/rwillh11/mdberta-token-bilingual-noContext_Enhanced}
}
Model Card Authors
Will Horne is an assistant professor of Political Science at Clemson University. Alona Dolinsky is a Research Associate at the Department of Communication Science at Vrije Universiteit Amsterdam. Lena Maria Huber is a ost-doctoral Research Fellow at the Mannheim Centre for European Social Research (MZES) at the University of Mannheim.
Model Card Contact
For questions about this model, please open an issue in the repository or contact the lead author at rwhorne@clemson.edu.
- Downloads last month
- 9
Model tree for rwillh11/mdberta-token-bilingual-noContext_Enhanced
Base model
microsoft/mdeberta-v3-base