---
language: en
license: mit
pipeline_tag: text-classification
tags:
- text-classification
- xlm-roberta
- survey-classification
- European Social Survey
datasets:
- benjaminBeuster/ess_classification
metrics:
- accuracy
- f1
- precision
- recall
base_model: FacebookAI/xlm-roberta-base
widget:
- text: "What is your age?"
  example_title: "Demographics Question"
- text: "How satisfied are you with the healthcare system?"
  example_title: "Health Question"
- text: "Trust in country's parliament"
  example_title: "Politics Question"
- text: "What is your highest level of education?"
  example_title: "Education Question"
- text: "How often do you pray?"
  example_title: "Religious Question"
---

# XLM-RoBERTa-Base for ESS Variable Classification

Fine-tuned XLM-RoBERTa-Base model for classifying European Social Survey variables into 19 subject categories.

## Model Description

This model is a fine-tuned version of [`FacebookAI/xlm-roberta-base`](https://huggingface.co/FacebookAI/xlm-roberta-base) on the ESS variable classification dataset. It classifies survey questions and variables into predefined subject categories.

- **Base Model**: XLM-RoBERTa-Base (125M parameters)
- **Task**: Multi-class text classification (19 categories)
- **Language**: English
- **Dataset**: European Social Survey variables

## Performance

Evaluated on test set:

- **Accuracy**: 0.8381
- **Precision** (weighted): 0.7858
- **Recall** (weighted): 0.8381
- **F1-Score** (weighted): 0.7959
- **Test samples**: 105


## Intended Use

This model is designed to automatically classify survey variables and questions from social science research into subject categories. It can be used for:

- Organizing large survey datasets
- Automating metadata generation
- Subject classification of research questions
- Data cataloging and discovery

## Training Data

The model was trained on the [`benjaminBeuster/ess_classification`](https://huggingface.co/datasets/benjaminBeuster/ess_classification) dataset, which contains survey variables extracted from European Social Survey DDI XML files.

## Label Mapping

The model predicts one of 19 subject categories:

| Code | Category |
|------|----------|
| 0 | DEMOGRAPHY (POPULATION, VITAL STATISTICS, AND CENSUSES) |
| 1 | ECONOMICS |
| 2 | EDUCATION |
| 3 | HEALTH |
| 4 | HISTORY |
| 5 | HOUSING AND LAND USE |
| 6 | LABOUR AND EMPLOYMENT |
| 7 | LAW, CRIME AND LEGAL SYSTEMS |
| 8 | MEDIA, COMMUNICATION AND LANGUAGE |
| 9 | NATURAL ENVIRONMENT |
| 10 | OTHER |
| 11 | POLITICS |
| 12 | PSYCHOLOGY |
| 13 | SCIENCE AND TECHNOLOGY |
| 14 | SOCIAL STRATIFICATION AND GROUPINGS |
| 15 | SOCIAL WELFARE POLICY AND SYSTEMS |
| 16 | SOCIETY AND CULTURE |
| 17 | TRADE, INDUSTRY AND MARKETS |
| 18 | TRANSPORT AND TRAVEL |


## Usage

### Basic Classification

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

# Load model and tokenizer
model_name = "benjaminBeuster/xlm-roberta-base-ess-classification"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Classify a survey question
text = "Trust in country's parliament. Using this card, please tell me on a score of 0-10 how much you personally trust each of the institutions I read out."
result = classifier(text)

print(f"Category: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.4f}")
```

### Batch Classification

```python
# Classify multiple questions
questions = [
    "How often pray apart from at religious services",
    "Highest level of education completed",
    "Trust in politicians"
]

results = classifier(questions)
for question, result in zip(questions, results):
    print(f"{question[:50]}: {result['label']} ({result['score']:.2f})")
```

### Manual Prediction

```python
import torch

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1).item()

# Get label name
label = model.config.id2label[predicted_class]
confidence = predictions[0][predicted_class].item()

print(f"Predicted: {label} (confidence: {confidence:.4f})")
```

## Training Procedure

### Training Hyperparameters

- **Learning rate**: 2e-05
- **Batch size**: 8
- **Epochs**: 5
- **Weight decay**: 0.01
- **Warmup ratio**: 0.1
- **Max sequence length**: 256
- **Optimizer**: AdamW
- **LR scheduler**: Linear with warmup

### Training Details

The model was fine-tuned using the Hugging Face Transformers library with the following setup:

- Early stopping with patience of 2 epochs
- Evaluation on validation set after each epoch
- Best model selection based on validation loss
- Mixed precision training (fp16/bf16 where supported)

## Limitations and Bias

- The model is trained on a relatively small dataset (50 samples), which may limit generalization
- Performance may vary on survey questions outside the European Social Survey domain
- The model may inherit biases present in the training data
- English-language surveys are the primary focus, though the base model supports 100 languages

## Citation

If you use this model, please cite:

```bibtex
@misc{xlm-roberta-ess-classifier,
  author = {Benjamin Beuster},
  title = {XLM-RoBERTa-Large for ESS Variable Classification},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification}
}
```

## Model Card Authors

Benjamin Beuster

## Model Card Contact

For questions or feedback, please open an issue on the [model repository](https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification).