Upload README.md with huggingface_hub

e1791b2 verified 4 months ago

5.91 kB

	---
	language: en
	license: mit
	pipeline_tag: text-classification
	tags:
	- text-classification
	- xlm-roberta
	- survey-classification
	- European Social Survey
	datasets:
	- benjaminBeuster/ess_classification
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	base_model: FacebookAI/xlm-roberta-base
	widget:
	- text: "What is your age?"
	example_title: "Demographics Question"
	- text: "How satisfied are you with the healthcare system?"
	example_title: "Health Question"
	- text: "Trust in country's parliament"
	example_title: "Politics Question"
	- text: "What is your highest level of education?"
	example_title: "Education Question"
	- text: "How often do you pray?"
	example_title: "Religious Question"
	---

	# XLM-RoBERTa-Base for ESS Variable Classification

	Fine-tuned XLM-RoBERTa-Base model for classifying European Social Survey variables into 19 subject categories.

	## Model Description

	This model is a fine-tuned version of [`FacebookAI/xlm-roberta-base`](https://huggingface.co/FacebookAI/xlm-roberta-base) on the ESS variable classification dataset. It classifies survey questions and variables into predefined subject categories.

	- Base Model: XLM-RoBERTa-Base (125M parameters)
	- Task: Multi-class text classification (19 categories)
	- Language: English
	- Dataset: European Social Survey variables

	## Performance

	Evaluated on test set:

	- Accuracy: 0.8381
	- Precision (weighted): 0.7858
	- Recall (weighted): 0.8381
	- F1-Score (weighted): 0.7959
	- Test samples: 105


	## Intended Use

	This model is designed to automatically classify survey variables and questions from social science research into subject categories. It can be used for:

	- Organizing large survey datasets
	- Automating metadata generation
	- Subject classification of research questions
	- Data cataloging and discovery

	## Training Data

	The model was trained on the [`benjaminBeuster/ess_classification`](https://huggingface.co/datasets/benjaminBeuster/ess_classification) dataset, which contains survey variables extracted from European Social Survey DDI XML files.

	## Label Mapping

	The model predicts one of 19 subject categories:

	\| Code \| Category \|
	\|------\|----------\|
	\| 0 \| DEMOGRAPHY (POPULATION, VITAL STATISTICS, AND CENSUSES) \|
	\| 1 \| ECONOMICS \|
	\| 2 \| EDUCATION \|
	\| 3 \| HEALTH \|
	\| 4 \| HISTORY \|
	\| 5 \| HOUSING AND LAND USE \|
	\| 6 \| LABOUR AND EMPLOYMENT \|
	\| 7 \| LAW, CRIME AND LEGAL SYSTEMS \|
	\| 8 \| MEDIA, COMMUNICATION AND LANGUAGE \|
	\| 9 \| NATURAL ENVIRONMENT \|
	\| 10 \| OTHER \|
	\| 11 \| POLITICS \|
	\| 12 \| PSYCHOLOGY \|
	\| 13 \| SCIENCE AND TECHNOLOGY \|
	\| 14 \| SOCIAL STRATIFICATION AND GROUPINGS \|
	\| 15 \| SOCIAL WELFARE POLICY AND SYSTEMS \|
	\| 16 \| SOCIETY AND CULTURE \|
	\| 17 \| TRADE, INDUSTRY AND MARKETS \|
	\| 18 \| TRANSPORT AND TRAVEL \|


	## Usage

	### Basic Classification

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

	# Load model and tokenizer
	model_name = "benjaminBeuster/xlm-roberta-base-ess-classification"
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	# Create classification pipeline
	classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

	# Classify a survey question
	text = "Trust in country's parliament. Using this card, please tell me on a score of 0-10 how much you personally trust each of the institutions I read out."
	result = classifier(text)

	print(f"Category: {result[0]['label']}")
	print(f"Confidence: {result[0]['score']:.4f}")
	```

	### Batch Classification

	```python
	# Classify multiple questions
	questions = [
	"How often pray apart from at religious services",
	"Highest level of education completed",
	"Trust in politicians"
	]

	results = classifier(questions)
	for question, result in zip(questions, results):
	print(f"{question[:50]}: {result['label']} ({result['score']:.2f})")
	```

	### Manual Prediction

	```python
	import torch

	# Tokenize input
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

	# Get predictions
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = torch.argmax(predictions, dim=-1).item()

	# Get label name
	label = model.config.id2label[predicted_class]
	confidence = predictions[0][predicted_class].item()

	print(f"Predicted: {label} (confidence: {confidence:.4f})")
	```

	## Training Procedure

	### Training Hyperparameters

	- Learning rate: 2e-05
	- Batch size: 8
	- Epochs: 5
	- Weight decay: 0.01
	- Warmup ratio: 0.1
	- Max sequence length: 256
	- Optimizer: AdamW
	- LR scheduler: Linear with warmup

	### Training Details

	The model was fine-tuned using the Hugging Face Transformers library with the following setup:

	- Early stopping with patience of 2 epochs
	- Evaluation on validation set after each epoch
	- Best model selection based on validation loss
	- Mixed precision training (fp16/bf16 where supported)

	## Limitations and Bias

	- The model is trained on a relatively small dataset (50 samples), which may limit generalization
	- Performance may vary on survey questions outside the European Social Survey domain
	- The model may inherit biases present in the training data
	- English-language surveys are the primary focus, though the base model supports 100 languages

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{xlm-roberta-ess-classifier,
	author = {Benjamin Beuster},
	title = {XLM-RoBERTa-Large for ESS Variable Classification},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification}
	}
	```

	## Model Card Authors

	Benjamin Beuster

	## Model Card Contact

	For questions or feedback, please open an issue on the [model repository](https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification).