File size: 5,905 Bytes
164486c 7874b5d 7b2e8c3 7874b5d 3ca26a2 7874b5d 3ca26a2 e1791b2 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 7874b5d 164486c 3ca26a2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
---
language: en
license: mit
pipeline_tag: text-classification
tags:
- text-classification
- xlm-roberta
- survey-classification
- European Social Survey
datasets:
- benjaminBeuster/ess_classification
metrics:
- accuracy
- f1
- precision
- recall
base_model: FacebookAI/xlm-roberta-base
widget:
- text: "What is your age?"
example_title: "Demographics Question"
- text: "How satisfied are you with the healthcare system?"
example_title: "Health Question"
- text: "Trust in country's parliament"
example_title: "Politics Question"
- text: "What is your highest level of education?"
example_title: "Education Question"
- text: "How often do you pray?"
example_title: "Religious Question"
---
# XLM-RoBERTa-Base for ESS Variable Classification
Fine-tuned XLM-RoBERTa-Base model for classifying European Social Survey variables into 19 subject categories.
## Model Description
This model is a fine-tuned version of [`FacebookAI/xlm-roberta-base`](https://huggingface.co/FacebookAI/xlm-roberta-base) on the ESS variable classification dataset. It classifies survey questions and variables into predefined subject categories.
- **Base Model**: XLM-RoBERTa-Base (125M parameters)
- **Task**: Multi-class text classification (19 categories)
- **Language**: English
- **Dataset**: European Social Survey variables
## Performance
Evaluated on test set:
- **Accuracy**: 0.8381
- **Precision** (weighted): 0.7858
- **Recall** (weighted): 0.8381
- **F1-Score** (weighted): 0.7959
- **Test samples**: 105
## Intended Use
This model is designed to automatically classify survey variables and questions from social science research into subject categories. It can be used for:
- Organizing large survey datasets
- Automating metadata generation
- Subject classification of research questions
- Data cataloging and discovery
## Training Data
The model was trained on the [`benjaminBeuster/ess_classification`](https://huggingface.co/datasets/benjaminBeuster/ess_classification) dataset, which contains survey variables extracted from European Social Survey DDI XML files.
## Label Mapping
The model predicts one of 19 subject categories:
| Code | Category |
|------|----------|
| 0 | DEMOGRAPHY (POPULATION, VITAL STATISTICS, AND CENSUSES) |
| 1 | ECONOMICS |
| 2 | EDUCATION |
| 3 | HEALTH |
| 4 | HISTORY |
| 5 | HOUSING AND LAND USE |
| 6 | LABOUR AND EMPLOYMENT |
| 7 | LAW, CRIME AND LEGAL SYSTEMS |
| 8 | MEDIA, COMMUNICATION AND LANGUAGE |
| 9 | NATURAL ENVIRONMENT |
| 10 | OTHER |
| 11 | POLITICS |
| 12 | PSYCHOLOGY |
| 13 | SCIENCE AND TECHNOLOGY |
| 14 | SOCIAL STRATIFICATION AND GROUPINGS |
| 15 | SOCIAL WELFARE POLICY AND SYSTEMS |
| 16 | SOCIETY AND CULTURE |
| 17 | TRADE, INDUSTRY AND MARKETS |
| 18 | TRANSPORT AND TRAVEL |
## Usage
### Basic Classification
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
# Load model and tokenizer
model_name = "benjaminBeuster/xlm-roberta-base-ess-classification"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Create classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Classify a survey question
text = "Trust in country's parliament. Using this card, please tell me on a score of 0-10 how much you personally trust each of the institutions I read out."
result = classifier(text)
print(f"Category: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.4f}")
```
### Batch Classification
```python
# Classify multiple questions
questions = [
"How often pray apart from at religious services",
"Highest level of education completed",
"Trust in politicians"
]
results = classifier(questions)
for question, result in zip(questions, results):
print(f"{question[:50]}: {result['label']} ({result['score']:.2f})")
```
### Manual Prediction
```python
import torch
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1).item()
# Get label name
label = model.config.id2label[predicted_class]
confidence = predictions[0][predicted_class].item()
print(f"Predicted: {label} (confidence: {confidence:.4f})")
```
## Training Procedure
### Training Hyperparameters
- **Learning rate**: 2e-05
- **Batch size**: 8
- **Epochs**: 5
- **Weight decay**: 0.01
- **Warmup ratio**: 0.1
- **Max sequence length**: 256
- **Optimizer**: AdamW
- **LR scheduler**: Linear with warmup
### Training Details
The model was fine-tuned using the Hugging Face Transformers library with the following setup:
- Early stopping with patience of 2 epochs
- Evaluation on validation set after each epoch
- Best model selection based on validation loss
- Mixed precision training (fp16/bf16 where supported)
## Limitations and Bias
- The model is trained on a relatively small dataset (50 samples), which may limit generalization
- Performance may vary on survey questions outside the European Social Survey domain
- The model may inherit biases present in the training data
- English-language surveys are the primary focus, though the base model supports 100 languages
## Citation
If you use this model, please cite:
```bibtex
@misc{xlm-roberta-ess-classifier,
author = {Benjamin Beuster},
title = {XLM-RoBERTa-Large for ESS Variable Classification},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification}
}
```
## Model Card Authors
Benjamin Beuster
## Model Card Contact
For questions or feedback, please open an issue on the [model repository](https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification).
|