|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- text-classification |
|
|
- xlm-roberta |
|
|
- survey-classification |
|
|
- European Social Survey |
|
|
datasets: |
|
|
- benjaminBeuster/ess_classification |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
base_model: FacebookAI/xlm-roberta-base |
|
|
widget: |
|
|
- text: "What is your age?" |
|
|
example_title: "Demographics Question" |
|
|
- text: "How satisfied are you with the healthcare system?" |
|
|
example_title: "Health Question" |
|
|
- text: "Trust in country's parliament" |
|
|
example_title: "Politics Question" |
|
|
- text: "What is your highest level of education?" |
|
|
example_title: "Education Question" |
|
|
- text: "How often do you pray?" |
|
|
example_title: "Religious Question" |
|
|
--- |
|
|
|
|
|
# XLM-RoBERTa-Base for ESS Variable Classification |
|
|
|
|
|
Fine-tuned XLM-RoBERTa-Base model for classifying European Social Survey variables into 19 subject categories. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a fine-tuned version of [`FacebookAI/xlm-roberta-base`](https://huggingface.co/FacebookAI/xlm-roberta-base) on the ESS variable classification dataset. It classifies survey questions and variables into predefined subject categories. |
|
|
|
|
|
- **Base Model**: XLM-RoBERTa-Base (125M parameters) |
|
|
- **Task**: Multi-class text classification (19 categories) |
|
|
- **Language**: English |
|
|
- **Dataset**: European Social Survey variables |
|
|
|
|
|
## Performance |
|
|
|
|
|
Evaluated on test set: |
|
|
|
|
|
- **Accuracy**: 0.8381 |
|
|
- **Precision** (weighted): 0.7858 |
|
|
- **Recall** (weighted): 0.8381 |
|
|
- **F1-Score** (weighted): 0.7959 |
|
|
- **Test samples**: 105 |
|
|
|
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed to automatically classify survey variables and questions from social science research into subject categories. It can be used for: |
|
|
|
|
|
- Organizing large survey datasets |
|
|
- Automating metadata generation |
|
|
- Subject classification of research questions |
|
|
- Data cataloging and discovery |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on the [`benjaminBeuster/ess_classification`](https://huggingface.co/datasets/benjaminBeuster/ess_classification) dataset, which contains survey variables extracted from European Social Survey DDI XML files. |
|
|
|
|
|
## Label Mapping |
|
|
|
|
|
The model predicts one of 19 subject categories: |
|
|
|
|
|
| Code | Category | |
|
|
|------|----------| |
|
|
| 0 | DEMOGRAPHY (POPULATION, VITAL STATISTICS, AND CENSUSES) | |
|
|
| 1 | ECONOMICS | |
|
|
| 2 | EDUCATION | |
|
|
| 3 | HEALTH | |
|
|
| 4 | HISTORY | |
|
|
| 5 | HOUSING AND LAND USE | |
|
|
| 6 | LABOUR AND EMPLOYMENT | |
|
|
| 7 | LAW, CRIME AND LEGAL SYSTEMS | |
|
|
| 8 | MEDIA, COMMUNICATION AND LANGUAGE | |
|
|
| 9 | NATURAL ENVIRONMENT | |
|
|
| 10 | OTHER | |
|
|
| 11 | POLITICS | |
|
|
| 12 | PSYCHOLOGY | |
|
|
| 13 | SCIENCE AND TECHNOLOGY | |
|
|
| 14 | SOCIAL STRATIFICATION AND GROUPINGS | |
|
|
| 15 | SOCIAL WELFARE POLICY AND SYSTEMS | |
|
|
| 16 | SOCIETY AND CULTURE | |
|
|
| 17 | TRADE, INDUSTRY AND MARKETS | |
|
|
| 18 | TRANSPORT AND TRAVEL | |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Classification |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "benjaminBeuster/xlm-roberta-base-ess-classification" |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
|
|
# Create classification pipeline |
|
|
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) |
|
|
|
|
|
# Classify a survey question |
|
|
text = "Trust in country's parliament. Using this card, please tell me on a score of 0-10 how much you personally trust each of the institutions I read out." |
|
|
result = classifier(text) |
|
|
|
|
|
print(f"Category: {result[0]['label']}") |
|
|
print(f"Confidence: {result[0]['score']:.4f}") |
|
|
``` |
|
|
|
|
|
### Batch Classification |
|
|
|
|
|
```python |
|
|
# Classify multiple questions |
|
|
questions = [ |
|
|
"How often pray apart from at religious services", |
|
|
"Highest level of education completed", |
|
|
"Trust in politicians" |
|
|
] |
|
|
|
|
|
results = classifier(questions) |
|
|
for question, result in zip(questions, results): |
|
|
print(f"{question[:50]}: {result['label']} ({result['score']:.2f})") |
|
|
``` |
|
|
|
|
|
### Manual Prediction |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
# Tokenize input |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) |
|
|
|
|
|
# Get predictions |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
predicted_class = torch.argmax(predictions, dim=-1).item() |
|
|
|
|
|
# Get label name |
|
|
label = model.config.id2label[predicted_class] |
|
|
confidence = predictions[0][predicted_class].item() |
|
|
|
|
|
print(f"Predicted: {label} (confidence: {confidence:.4f})") |
|
|
``` |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
- **Learning rate**: 2e-05 |
|
|
- **Batch size**: 8 |
|
|
- **Epochs**: 5 |
|
|
- **Weight decay**: 0.01 |
|
|
- **Warmup ratio**: 0.1 |
|
|
- **Max sequence length**: 256 |
|
|
- **Optimizer**: AdamW |
|
|
- **LR scheduler**: Linear with warmup |
|
|
|
|
|
### Training Details |
|
|
|
|
|
The model was fine-tuned using the Hugging Face Transformers library with the following setup: |
|
|
|
|
|
- Early stopping with patience of 2 epochs |
|
|
- Evaluation on validation set after each epoch |
|
|
- Best model selection based on validation loss |
|
|
- Mixed precision training (fp16/bf16 where supported) |
|
|
|
|
|
## Limitations and Bias |
|
|
|
|
|
- The model is trained on a relatively small dataset (50 samples), which may limit generalization |
|
|
- Performance may vary on survey questions outside the European Social Survey domain |
|
|
- The model may inherit biases present in the training data |
|
|
- English-language surveys are the primary focus, though the base model supports 100 languages |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{xlm-roberta-ess-classifier, |
|
|
author = {Benjamin Beuster}, |
|
|
title = {XLM-RoBERTa-Large for ESS Variable Classification}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Benjamin Beuster |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions or feedback, please open an issue on the [model repository](https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification). |
|
|
|