--- language: en license: mit pipeline_tag: text-classification tags: - text-classification - xlm-roberta - survey-classification - European Social Survey datasets: - benjaminBeuster/ess_classification metrics: - accuracy - f1 - precision - recall base_model: FacebookAI/xlm-roberta-base widget: - text: "What is your age?" example_title: "Demographics Question" - text: "How satisfied are you with the healthcare system?" example_title: "Health Question" - text: "Trust in country's parliament" example_title: "Politics Question" - text: "What is your highest level of education?" example_title: "Education Question" - text: "How often do you pray?" example_title: "Religious Question" --- # XLM-RoBERTa-Base for ESS Variable Classification Fine-tuned XLM-RoBERTa-Base model for classifying European Social Survey variables into 19 subject categories. ## Model Description This model is a fine-tuned version of [`FacebookAI/xlm-roberta-base`](https://huggingface.co/FacebookAI/xlm-roberta-base) on the ESS variable classification dataset. It classifies survey questions and variables into predefined subject categories. - **Base Model**: XLM-RoBERTa-Base (125M parameters) - **Task**: Multi-class text classification (19 categories) - **Language**: English - **Dataset**: European Social Survey variables ## Performance Evaluated on test set: - **Accuracy**: 0.8381 - **Precision** (weighted): 0.7858 - **Recall** (weighted): 0.8381 - **F1-Score** (weighted): 0.7959 - **Test samples**: 105 ## Intended Use This model is designed to automatically classify survey variables and questions from social science research into subject categories. It can be used for: - Organizing large survey datasets - Automating metadata generation - Subject classification of research questions - Data cataloging and discovery ## Training Data The model was trained on the [`benjaminBeuster/ess_classification`](https://huggingface.co/datasets/benjaminBeuster/ess_classification) dataset, which contains survey variables extracted from European Social Survey DDI XML files. ## Label Mapping The model predicts one of 19 subject categories: | Code | Category | |------|----------| | 0 | DEMOGRAPHY (POPULATION, VITAL STATISTICS, AND CENSUSES) | | 1 | ECONOMICS | | 2 | EDUCATION | | 3 | HEALTH | | 4 | HISTORY | | 5 | HOUSING AND LAND USE | | 6 | LABOUR AND EMPLOYMENT | | 7 | LAW, CRIME AND LEGAL SYSTEMS | | 8 | MEDIA, COMMUNICATION AND LANGUAGE | | 9 | NATURAL ENVIRONMENT | | 10 | OTHER | | 11 | POLITICS | | 12 | PSYCHOLOGY | | 13 | SCIENCE AND TECHNOLOGY | | 14 | SOCIAL STRATIFICATION AND GROUPINGS | | 15 | SOCIAL WELFARE POLICY AND SYSTEMS | | 16 | SOCIETY AND CULTURE | | 17 | TRADE, INDUSTRY AND MARKETS | | 18 | TRANSPORT AND TRAVEL | ## Usage ### Basic Classification ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline # Load model and tokenizer model_name = "benjaminBeuster/xlm-roberta-base-ess-classification" model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Create classification pipeline classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) # Classify a survey question text = "Trust in country's parliament. Using this card, please tell me on a score of 0-10 how much you personally trust each of the institutions I read out." result = classifier(text) print(f"Category: {result[0]['label']}") print(f"Confidence: {result[0]['score']:.4f}") ``` ### Batch Classification ```python # Classify multiple questions questions = [ "How often pray apart from at religious services", "Highest level of education completed", "Trust in politicians" ] results = classifier(questions) for question, result in zip(questions, results): print(f"{question[:50]}: {result['label']} ({result['score']:.2f})") ``` ### Manual Prediction ```python import torch # Tokenize input inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) # Get predictions with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(predictions, dim=-1).item() # Get label name label = model.config.id2label[predicted_class] confidence = predictions[0][predicted_class].item() print(f"Predicted: {label} (confidence: {confidence:.4f})") ``` ## Training Procedure ### Training Hyperparameters - **Learning rate**: 2e-05 - **Batch size**: 8 - **Epochs**: 5 - **Weight decay**: 0.01 - **Warmup ratio**: 0.1 - **Max sequence length**: 256 - **Optimizer**: AdamW - **LR scheduler**: Linear with warmup ### Training Details The model was fine-tuned using the Hugging Face Transformers library with the following setup: - Early stopping with patience of 2 epochs - Evaluation on validation set after each epoch - Best model selection based on validation loss - Mixed precision training (fp16/bf16 where supported) ## Limitations and Bias - The model is trained on a relatively small dataset (50 samples), which may limit generalization - Performance may vary on survey questions outside the European Social Survey domain - The model may inherit biases present in the training data - English-language surveys are the primary focus, though the base model supports 100 languages ## Citation If you use this model, please cite: ```bibtex @misc{xlm-roberta-ess-classifier, author = {Benjamin Beuster}, title = {XLM-RoBERTa-Large for ESS Variable Classification}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification} } ``` ## Model Card Authors Benjamin Beuster ## Model Card Contact For questions or feedback, please open an issue on the [model repository](https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification).