--- library_name: transformers tags: - group-detection - political-science - multilingual - multilabel-classification - deberta - group-appeals language: - en - de - nl - da - fr - es - it - sv base_model: microsoft/mdeberta-v3-base --- # Model Card for mDeBERTa Group Detection A multilingual group classification model fine-tuned for classifying social group tokens into meaningful social groups categories in political text. ## Model Details ### Model Description This model is a fine-tuned mDeBERTa-v3-base that performs multilabel classification to classify social group tokens mentioned in political text into meaningful social groups categories. The model can classify a token into multiple group categories simultaneously to support intersectionality, and was trained on political manifesto data. - **Developed by:** Will Horne, Alona O. Dolinsky and Lena Maria Huber - **Model type:** Multilabel Sequence Classification - **Language(s) (NLP):** English, German (multilingual) - **Finetuned from model:** microsoft/mdeberta-v3-base ### Model Sources - **Repository:** rwillh11/mdeberta_groups_2.0 - **Base Model:** [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) ## Uses ### Direct Use The model is designed for researchers analyzing political discourse to automatically classify **social group tokens or phrases** into meaningful social group categories. It takes individual group mentions (e.g., "workers", "students", "citizens") as input and outputs predictions for 44 different group categories: - Adults, Caregivers, Children, Citizens, Civil servants, Consumers - Crime victims, Criminals, Education professionals, Elderly people - Employees and workers, Employers and business owners, Ethnic and national communities - Families, Farmers, Health professionals, Homeless people, Homeowners and landowners - Investors and stakeholders, Landlords, Law enforcement personnel, LGBTQI - Lower class, Manual and service workers, Men, Middle class, Migrants and refugees - Military personnel, Patients, People with disabilities, Politicians, Religious communities - Road users, Rural communities, Sociocultural professionals, Students, Taxpayers - Tenants, Unemployed, Upper class, White collar workers, Women, Young people - and a residual category of "Other" ### Downstream Use This model can be integrated into larger political text analysis pipelines for: - **Step 2 of group analysis**: After extracting group mentions from text, classify them into meaningful categories - Political manifestos analysis and group categorization - Comparative political research across countries and languages - Social group representation studies with consistent categorization ### Out-of-Scope Use This model should not be used for: - **Detecting group mentions within full text** (this model classifies pre-identified group tokens) - General entity recognition or named entity recognition tasks - Processing full sentences or paragraphs directly - Real-time social media monitoring without human oversight - Making decisions about individuals or groups - Content moderation without additional validation ## Bias, Risks, and Limitations ### Technical Limitations - Trained specifically on political manifesto text; performance may vary on other text types - Limited to 44 predefined group categories - Multilabel predictions may have dependencies between group categories ### Bias Considerations - Training data consists of political manifestos from specific countries and time periods - May reflect biases present in political discourse of training data ### Recommendations Users should be aware that this model: - Is designed for research purposes in political science - Should be validated on specific domains before deployment - May require human oversight for sensitive applications - Performance may vary across different types of groups and political contexts ## How to Get Started with the Model ### Recommended Usage (Pipeline) ```python from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification # Load model and tokenizer model_repo = "rwillh11/mdeberta_groups_2.0" tokenizer = AutoTokenizer.from_pretrained("microsoft/mdeberta-v3-base") model = AutoModelForSequenceClassification.from_pretrained(model_repo) # Create pipeline for multilabel classification classifier = pipeline( "text-classification", model=model, tokenizer=tokenizer, return_all_scores=True, device=0 # Use GPU if available ) # Example usage - classify group tokens/phrases group_tokens = ["students", "workers", "teachers", "citizens", "elderly people"] # Get predictions predictions = classifier(group_tokens) # Process results with 0.5 threshold for token, prediction in zip(group_tokens, predictions): predicted_labels = [label_score['label'] for label_score in prediction if label_score['score'] > 0.5] print(f"'{token}' → {predicted_labels}") ``` ### Manual Implementation ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer model_name = "rwillh11/mdeberta_groups_2.0" tokenizer = AutoTokenizer.from_pretrained("microsoft/mdeberta-v3-base") model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example group tokens group_tokens = ["workers", "citizens", "students"] for token in group_tokens: # Tokenize inputs = tokenizer(token, return_tensors="pt", truncation=True, max_length=128) # Get predictions with torch.no_grad(): outputs = model(**inputs) predictions = torch.sigmoid(outputs.logits) # Apply threshold (0.5) to get binary predictions binary_predictions = (predictions > 0.5).cpu().numpy() # Get predicted label indices predicted_indices = [i for i, pred in enumerate(binary_predictions[0]) if pred] print(f"'{token}' predicted categories: {predicted_indices}") ``` ## Training Details ### Training Data The model was trained on political manifesto data containing: - **Languages:** English and German - **Text Type:** Political manifesto sentences and group mentions - **Labels:** Multiple social group categories (multilabel classification) - **Source:** `final_group_train.csv` - **Training Size:** 2,454 examples (80% split) - **Validation Size:** 614 examples (20% split) - **Data processing:** MultiLabelBinarizer for one-hot encoding of group labels ### Training Procedure #### Preprocessing - Texts tokenized using mDeBERTa tokenizer with max length 128 - Multilabel binarization using scikit-learn's MultiLabelBinarizer - Each text can have multiple group labels simultaneously #### Training Hyperparameters (Optimal from Optuna) - **Training regime:** Mixed precision training with gradient accumulation - **Optimizer:** AdamW - **Learning rate:** 1.9432557585419205e-05 (optimized via Optuna) - **Weight decay:** 0.11740203810285466 (optimized via Optuna) - **Warmup ratio:** 0.018423412349675528 (optimized via Optuna) - **Epochs:** 30 - **Batch size:** 8 (train and eval) - **Gradient accumulation steps:** 2 - **Trials:** 7 Optuna trials for hyperparameter optimization - **Metric for selection:** F1 Score - **Seed:** 42 (partial deterministic training - only Transformers seed set) - **Pruning:** MedianPruner with 5 warmup steps #### Training Infrastructure - **Hardware:** CUDA-enabled GPU (Google Colab) - **Framework:** Transformers, PyTorch - **Hyperparameter optimization:** Optuna with MedianPruner - **Early stopping:** MedianPruner with 5 warmup steps ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data - 20% holdout from original dataset - Multilingual political manifesto sentences with group annotations #### Factors The model was evaluated across: - **Languages:** English and German text - **Group categories:** 44 different social group types - **Multilabel performance:** Ability to predict multiple groups per text #### Metrics Primary metrics used for evaluation: - **F1 Score:** Primary optimization metric for multilabel classification - **Accuracy:** Overall prediction accuracy - **Precision:** Precision across all labels - **Recall:** Recall across all labels ### Results **Best Model Performance (Trial 4, Epoch 27):** - **Accuracy:** 0.9942 - **F1 Score:** 0.8537 - **Precision:** 0.8633 - **Recall:** 0.8443 The model demonstrates strong performance in multilabel group detection with consistent results across hyperparameter trials and excellent convergence during training. Additional validation on held-out sets return the following micro-averaged metrics excluding the residual category "other": **English** - **Precision:** 0.894 - **Recall:** 0.868 - **F1 Micro:** 0.881 **German (using texts translated from English)** - **Precision:** 0.853 - **Recall:** 0.823 - **F1 Micro:** 0.838 **Dutch (using texts translated from English)** - **Precision:** 0.833 - **Recall:** 0.789 - **F1 Micro:** 0.817 **Danish (using texts translated from English)** - **Precision:** 0.845 - **Recall:** 0.789 - **F1 Micro:** 0.816 **Spanish (using texts translated from English)** - **Precision:** 0.838 - **Recall:** 0.792 - **F1 Micro:** 0.815 **French (using texts translated from English)** - **Precision:** 0.841 - **Recall:** 0.802 - **F1 Micro:** 0.821 **Italian (using texts translated from English)** - **Precision:** 0.837 - **Recall:** 0.788 - **F1 Micro:** 0.811 **Swedish (using texts translated from English)** - **Precision:** 0.837 - **Recall:** 0.774 - **F1 Micro:** 0.804 ## Model Examination The model uses a standard multilabel classification approach: - Sigmoid activation for independent probability prediction per group - Binary cross-entropy loss for multilabel training - Threshold of 0.5 for binary predictions - Supports detection of multiple groups simultaneously in a single text ## Environmental Impact Training involved hyperparameter optimization with 7 trials, each training for 30 epochs. - **Hardware Type:** CUDA-enabled GPU (Google Colab) - **Hours used:** Approximately 37-38 hours per trial (6 complete trials ≈ 4.5 hours each, ~27 total hours) - **Cloud Provider:** Google Colab - **Compute Region:** Variable - **Carbon Emitted:** Not precisely measured - **Training Date:** February 24, 2025 ## Technical Specifications ### Model Architecture and Objective - **Base Architecture:** mDeBERTa-v3-base (278M parameters) - **Task:** Multilabel sequence classification for group detection - **Input:** Political text (max length 128 tokens) - **Output:** Multi-dimensional binary vector for group presence - **Objective:** Binary cross-entropy loss with F1 score optimization - **Activation:** Sigmoid for independent probability prediction per group - **Threshold:** 0.5 for binary predictions ### Compute Infrastructure #### Hardware - GPU-accelerated training (CUDA) - Mixed precision training support #### Software - Transformers library - PyTorch framework - Optuna for hyperparameter optimization - scikit-learn for metrics and multilabel encoding ## Citation If you use this model in your research, please cite: **BibTeX:** ```bibtex @misc{mdeberta_groups_detection, title={mDeBERTa Group Detection Model for Political Text Analysis}, author={Will Horne and Alona O. Dolinsky and Lena Maria Huber}, year={2024}, note={Multilingual model for detecting social groups in political discourse} } ``` ## Model Card Authors Research team studying group appeals in political discourse. ## Model Card Contact For questions about this model, please contact the research team through appropriate academic channels.