benjaminBeuster's picture
Upload README.md with huggingface_hub
e1791b2 verified
metadata
language: en
license: mit
pipeline_tag: text-classification
tags:
  - text-classification
  - xlm-roberta
  - survey-classification
  - European Social Survey
datasets:
  - benjaminBeuster/ess_classification
metrics:
  - accuracy
  - f1
  - precision
  - recall
base_model: FacebookAI/xlm-roberta-base
widget:
  - text: What is your age?
    example_title: Demographics Question
  - text: How satisfied are you with the healthcare system?
    example_title: Health Question
  - text: Trust in country's parliament
    example_title: Politics Question
  - text: What is your highest level of education?
    example_title: Education Question
  - text: How often do you pray?
    example_title: Religious Question

XLM-RoBERTa-Base for ESS Variable Classification

Fine-tuned XLM-RoBERTa-Base model for classifying European Social Survey variables into 19 subject categories.

Model Description

This model is a fine-tuned version of FacebookAI/xlm-roberta-base on the ESS variable classification dataset. It classifies survey questions and variables into predefined subject categories.

  • Base Model: XLM-RoBERTa-Base (125M parameters)
  • Task: Multi-class text classification (19 categories)
  • Language: English
  • Dataset: European Social Survey variables

Performance

Evaluated on test set:

  • Accuracy: 0.8381
  • Precision (weighted): 0.7858
  • Recall (weighted): 0.8381
  • F1-Score (weighted): 0.7959
  • Test samples: 105

Intended Use

This model is designed to automatically classify survey variables and questions from social science research into subject categories. It can be used for:

  • Organizing large survey datasets
  • Automating metadata generation
  • Subject classification of research questions
  • Data cataloging and discovery

Training Data

The model was trained on the benjaminBeuster/ess_classification dataset, which contains survey variables extracted from European Social Survey DDI XML files.

Label Mapping

The model predicts one of 19 subject categories:

Code Category
0 DEMOGRAPHY (POPULATION, VITAL STATISTICS, AND CENSUSES)
1 ECONOMICS
2 EDUCATION
3 HEALTH
4 HISTORY
5 HOUSING AND LAND USE
6 LABOUR AND EMPLOYMENT
7 LAW, CRIME AND LEGAL SYSTEMS
8 MEDIA, COMMUNICATION AND LANGUAGE
9 NATURAL ENVIRONMENT
10 OTHER
11 POLITICS
12 PSYCHOLOGY
13 SCIENCE AND TECHNOLOGY
14 SOCIAL STRATIFICATION AND GROUPINGS
15 SOCIAL WELFARE POLICY AND SYSTEMS
16 SOCIETY AND CULTURE
17 TRADE, INDUSTRY AND MARKETS
18 TRANSPORT AND TRAVEL

Usage

Basic Classification

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

# Load model and tokenizer
model_name = "benjaminBeuster/xlm-roberta-base-ess-classification"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Classify a survey question
text = "Trust in country's parliament. Using this card, please tell me on a score of 0-10 how much you personally trust each of the institutions I read out."
result = classifier(text)

print(f"Category: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.4f}")

Batch Classification

# Classify multiple questions
questions = [
    "How often pray apart from at religious services",
    "Highest level of education completed",
    "Trust in politicians"
]

results = classifier(questions)
for question, result in zip(questions, results):
    print(f"{question[:50]}: {result['label']} ({result['score']:.2f})")

Manual Prediction

import torch

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1).item()

# Get label name
label = model.config.id2label[predicted_class]
confidence = predictions[0][predicted_class].item()

print(f"Predicted: {label} (confidence: {confidence:.4f})")

Training Procedure

Training Hyperparameters

  • Learning rate: 2e-05
  • Batch size: 8
  • Epochs: 5
  • Weight decay: 0.01
  • Warmup ratio: 0.1
  • Max sequence length: 256
  • Optimizer: AdamW
  • LR scheduler: Linear with warmup

Training Details

The model was fine-tuned using the Hugging Face Transformers library with the following setup:

  • Early stopping with patience of 2 epochs
  • Evaluation on validation set after each epoch
  • Best model selection based on validation loss
  • Mixed precision training (fp16/bf16 where supported)

Limitations and Bias

  • The model is trained on a relatively small dataset (50 samples), which may limit generalization
  • Performance may vary on survey questions outside the European Social Survey domain
  • The model may inherit biases present in the training data
  • English-language surveys are the primary focus, though the base model supports 100 languages

Citation

If you use this model, please cite:

@misc{xlm-roberta-ess-classifier,
  author = {Benjamin Beuster},
  title = {XLM-RoBERTa-Large for ESS Variable Classification},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification}
}

Model Card Authors

Benjamin Beuster

Model Card Contact

For questions or feedback, please open an issue on the model repository.