benjaminBeuster's picture
Upload README.md with huggingface_hub
e1791b2 verified
---
language: en
license: mit
pipeline_tag: text-classification
tags:
- text-classification
- xlm-roberta
- survey-classification
- European Social Survey
datasets:
- benjaminBeuster/ess_classification
metrics:
- accuracy
- f1
- precision
- recall
base_model: FacebookAI/xlm-roberta-base
widget:
- text: "What is your age?"
example_title: "Demographics Question"
- text: "How satisfied are you with the healthcare system?"
example_title: "Health Question"
- text: "Trust in country's parliament"
example_title: "Politics Question"
- text: "What is your highest level of education?"
example_title: "Education Question"
- text: "How often do you pray?"
example_title: "Religious Question"
---
# XLM-RoBERTa-Base for ESS Variable Classification
Fine-tuned XLM-RoBERTa-Base model for classifying European Social Survey variables into 19 subject categories.
## Model Description
This model is a fine-tuned version of [`FacebookAI/xlm-roberta-base`](https://huggingface.co/FacebookAI/xlm-roberta-base) on the ESS variable classification dataset. It classifies survey questions and variables into predefined subject categories.
- **Base Model**: XLM-RoBERTa-Base (125M parameters)
- **Task**: Multi-class text classification (19 categories)
- **Language**: English
- **Dataset**: European Social Survey variables
## Performance
Evaluated on test set:
- **Accuracy**: 0.8381
- **Precision** (weighted): 0.7858
- **Recall** (weighted): 0.8381
- **F1-Score** (weighted): 0.7959
- **Test samples**: 105
## Intended Use
This model is designed to automatically classify survey variables and questions from social science research into subject categories. It can be used for:
- Organizing large survey datasets
- Automating metadata generation
- Subject classification of research questions
- Data cataloging and discovery
## Training Data
The model was trained on the [`benjaminBeuster/ess_classification`](https://huggingface.co/datasets/benjaminBeuster/ess_classification) dataset, which contains survey variables extracted from European Social Survey DDI XML files.
## Label Mapping
The model predicts one of 19 subject categories:
| Code | Category |
|------|----------|
| 0 | DEMOGRAPHY (POPULATION, VITAL STATISTICS, AND CENSUSES) |
| 1 | ECONOMICS |
| 2 | EDUCATION |
| 3 | HEALTH |
| 4 | HISTORY |
| 5 | HOUSING AND LAND USE |
| 6 | LABOUR AND EMPLOYMENT |
| 7 | LAW, CRIME AND LEGAL SYSTEMS |
| 8 | MEDIA, COMMUNICATION AND LANGUAGE |
| 9 | NATURAL ENVIRONMENT |
| 10 | OTHER |
| 11 | POLITICS |
| 12 | PSYCHOLOGY |
| 13 | SCIENCE AND TECHNOLOGY |
| 14 | SOCIAL STRATIFICATION AND GROUPINGS |
| 15 | SOCIAL WELFARE POLICY AND SYSTEMS |
| 16 | SOCIETY AND CULTURE |
| 17 | TRADE, INDUSTRY AND MARKETS |
| 18 | TRANSPORT AND TRAVEL |
## Usage
### Basic Classification
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
# Load model and tokenizer
model_name = "benjaminBeuster/xlm-roberta-base-ess-classification"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Create classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Classify a survey question
text = "Trust in country's parliament. Using this card, please tell me on a score of 0-10 how much you personally trust each of the institutions I read out."
result = classifier(text)
print(f"Category: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.4f}")
```
### Batch Classification
```python
# Classify multiple questions
questions = [
"How often pray apart from at religious services",
"Highest level of education completed",
"Trust in politicians"
]
results = classifier(questions)
for question, result in zip(questions, results):
print(f"{question[:50]}: {result['label']} ({result['score']:.2f})")
```
### Manual Prediction
```python
import torch
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1).item()
# Get label name
label = model.config.id2label[predicted_class]
confidence = predictions[0][predicted_class].item()
print(f"Predicted: {label} (confidence: {confidence:.4f})")
```
## Training Procedure
### Training Hyperparameters
- **Learning rate**: 2e-05
- **Batch size**: 8
- **Epochs**: 5
- **Weight decay**: 0.01
- **Warmup ratio**: 0.1
- **Max sequence length**: 256
- **Optimizer**: AdamW
- **LR scheduler**: Linear with warmup
### Training Details
The model was fine-tuned using the Hugging Face Transformers library with the following setup:
- Early stopping with patience of 2 epochs
- Evaluation on validation set after each epoch
- Best model selection based on validation loss
- Mixed precision training (fp16/bf16 where supported)
## Limitations and Bias
- The model is trained on a relatively small dataset (50 samples), which may limit generalization
- Performance may vary on survey questions outside the European Social Survey domain
- The model may inherit biases present in the training data
- English-language surveys are the primary focus, though the base model supports 100 languages
## Citation
If you use this model, please cite:
```bibtex
@misc{xlm-roberta-ess-classifier,
author = {Benjamin Beuster},
title = {XLM-RoBERTa-Large for ESS Variable Classification},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification}
}
```
## Model Card Authors
Benjamin Beuster
## Model Card Contact
For questions or feedback, please open an issue on the [model repository](https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification).