File size: 17,307 Bytes
d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 483d812 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 db6aa40 d802797 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 | # NFQA Multilingual Question Classifier
A multilingual question classification model that categorizes questions into 8 distinct types based on the Non-Factoid Question Answering (NFQA) taxonomy.
## Model Description
This model classifies questions across **49 languages** into **8 categories** of question types, enabling better understanding of user intent and question characteristics for information retrieval and question answering systems.
### Model Details
- **Model Type**: Multilingual Text Classification
- **Base Model**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
- **Languages**: 49 languages (European, Asian, and Middle Eastern languages)
- **Categories**: 8 NFQA question types
- **Parameters**: ~278M parameters
- **Training Date**: January 2026
- **License**: apache-2.0
### Developers
Developed by Ali Salman for research in multilingual question understanding and classification.
### Architecture
The model is based on XLM-RoBERTa (Cross-lingual Language Model - Robustly Optimized BERT Approach), a transformer-based multilingual encoder:
- **Base Architecture**: 12-layer transformer encoder
- **Hidden Size**: 768
- **Attention Heads**: 12
- **Parameters**: ~278M
- **Vocabulary Size**: 250,000 tokens (SentencePiece)
- **Pre-training**: Trained on 2.5TB of CommonCrawl data in 100 languages
- **Fine-tuning**: Classification head with dropout (0.2) for 8-class NFQA classification
## Intended Use
### Primary Use Cases
- **Question Type Classification**: Automatically categorize user questions to route them to appropriate answering systems
- **Search Intent Understanding**: Enhance search engines by understanding the type of information users seek
- **Chatbot Development**: Improve conversational AI by identifying question types
- **FAQ Organization**: Automatically organize FAQ databases by question type
- **Content Recommendation**: Suggest relevant content based on question type
### Out-of-Scope Use
- This model is NOT designed for content moderation or filtering
- Should not be used as the sole decision-maker in high-stakes applications
- Not suitable for detecting malicious intent or harmful content
## Training Data
### Dataset
The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)**, a large-scale multilingual dataset for non-factoid question classification.
**Dataset Composition**:
- **Training**: 33,602 examples (70%)
- **Validation**: 6,979 examples (15%)
- **Test**: 7,696 examples (15%)
- **Total**: 48,277 balanced examples
**Source Distribution**:
- 54% from WebFAQ dataset (annotated with LLM ensemble)
- 46% AI-generated to balance language-category combinations
**Key Features**:
- 392 unique (language, category) combinations
- Target of ~125 examples per combination
- Stratified sampling to ensure balanced representation
- Ensemble annotation using Llama 3.1, Gemma 2, and Qwen 2.5
For detailed information about dataset generation, annotation methodology, and data composition, please visit the [dataset page](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
### Languages Supported
**European Languages** (29): English (en), German (de), French (fr), Spanish (es), Italian (it), Portuguese (pt), Dutch (nl), Polish (pl), Romanian (ro), Czech (cs), Slovak (sk), Bulgarian (bg), Croatian (hr), Serbian (sr), Slovenian (sl), Albanian (sq), Estonian (et), Latvian (lv), Lithuanian (lt), Danish (da), Norwegian (no), Swedish (sv), Finnish (fi), Icelandic (is), Greek (el), Turkish (tr), Ukrainian (uk), Russian (ru), Hungarian (hu)
**Asian Languages** (12): Chinese (zh), Japanese (ja), Korean (ko), Hindi (hi), Bengali (bn), Marathi (mr), Thai (th), Vietnamese (vi), Indonesian (id), Malay (ms), Tagalog/Filipino (tl), Urdu (ur)
**Middle Eastern Languages** (8): Arabic (ar), Persian/Farsi (fa), Hebrew (he), Georgian (ka), Azerbaijani (az), Kazakh (kk), Uzbek (uz)
## Classification Categories
The model classifies questions into 8 distinct categories:
### 1. NOT-A-QUESTION (Label 0)
Statements or phrases that are not actual questions.
**Examples:**
- "Price of dental treatment"
- "Best restaurants nearby"
- "Weather today"
### 2. FACTOID (Label 1)
Questions seeking factual, objective answers (who, what, when, where).
**Examples:**
- "What is the capital of France?"
- "When was the Eiffel Tower built?"
- "Who invented the telephone?"
### 3. DEBATE (Label 2)
Hypothetical, opinion-based, or debatable questions.
**Examples:**
- "Is artificial intelligence dangerous?"
- "Should we colonize Mars?"
- "Is remote work better than office work?"
### 4. EVIDENCE-BASED (Label 3)
Questions about definitions, features, or characteristics.
**Examples:**
- "What are the symptoms of flu?"
- "What features does this phone have?"
- "What is machine learning?"
### 5. INSTRUCTION (Label 4)
How-to questions requiring step-by-step procedural answers.
**Examples:**
- "How do I reset my password?"
- "How to bake chocolate chip cookies?"
- "How can I install Python on Windows?"
### 6. REASON (Label 5)
Why/how questions seeking explanations or reasoning.
**Examples:**
- "Why is the sky blue?"
- "How does photosynthesis work?"
- "Why do birds migrate?"
### 7. EXPERIENCE (Label 6)
Questions seeking personal experiences, recommendations, or advice.
**Examples:**
- "What's the best laptop for students?"
- "Has anyone tried this restaurant?"
- "Which hotel would you recommend?"
### 8. COMPARISON (Label 7)
Questions comparing two or more options.
**Examples:**
- "iPhone vs Android: which is better?"
- "What's the difference between RNA and DNA?"
- "Compare electric and gas cars"
## Model Performance
### Test Set Results (7,696 examples)
- **Overall Accuracy**: 88.1%
- **Macro-Average F1**: 88.1%
- **Best Validation F1**: 88.1% (achieved at epoch 6)
### Per-Category Performance
| Category | Precision | Recall | F1-Score | Support |
|----------|-----------|--------|----------|---------|
| NOT-A-QUESTION | 0.96 | 0.92 | 0.94 | 950 |
| FACTOID | 0.84 | 0.79 | 0.81 | 980 |
| DEBATE | 0.90 | 0.95 | 0.92 | 916 |
| EVIDENCE-BASED | 0.86 | 0.92 | 0.89 | 950 |
| INSTRUCTION | 0.85 | 0.92 | 0.88 | 980 |
| REASON | 0.88 | 0.86 | 0.87 | 960 |
| EXPERIENCE | 0.82 | 0.76 | 0.79 | 980 |
| COMPARISON | 0.93 | 0.93 | 0.93 | 980 |
### Key Observations
- **Strongest Performance**: NOT-A-QUESTION, COMPARISON, and DEBATE categories (F1 ≥ 0.92)
- **Good Performance**: EVIDENCE-BASED, INSTRUCTION, and REASON categories (F1 ≥ 0.87)
- **Moderate Performance**: FACTOID and EXPERIENCE categories (F1 ~ 0.79-0.81)
- The model generalizes well across all 49 languages with balanced test set distribution
### Confusion Matrix

The confusion matrix shows the model's prediction patterns across all 8 categories. The diagonal elements represent correct classifications, while off-diagonal elements show misclassifications between categories.
## Training Procedure
### Hardware
- Training Device: CUDA-enabled GPU (NVIDIA)
- Training Time: 6 epochs to reach best performance
### Hyperparameters
```python
{
"model_name": "xlm-roberta-base",
"max_length": 128, # Maximum sequence length
"batch_size": 16, # Training batch size
"learning_rate": 2e-5, # AdamW learning rate
"num_epochs": 6, # Total epochs trained
"warmup_steps": 500, # Linear warmup steps
"weight_decay": 0.01, # L2 regularization
"dropout": 0.2, # Dropout probability
"optimizer": "AdamW", # Optimizer
"scheduler": "linear_warmup", # Learning rate scheduler
"gradient_clipping": 1.0, # Max gradient norm
"random_seed": 42 # Reproducibility
}
```
### Training Process
1. **Data Preparation**: Pre-split balanced dataset from [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
- Training: 33,602 examples (70%)
- Validation: 6,979 examples (15%)
- Test: 7,696 examples (15%)
2. **Preprocessing**: Tokenization using XLM-RoBERTa tokenizer (max length: 128 tokens)
3. **Training Strategy**: Supervised fine-tuning with stratified train/val/test splits
- Stratified by (language, category) combinations to maintain balance
4. **Optimization**: AdamW optimizer with linear warmup and gradient clipping
- Total training steps: 12,606 (33,602 examples × 6 epochs ÷ 16 batch size)
- Warmup steps: 500
5. **Best Model Selection**: Model checkpoint with highest validation F1 score (epoch 6)
6. **Evaluation**: Comprehensive testing on held-out test set with per-category and per-language analysis
### Training Curves

The training curves show the model's learning progress across 6 epochs:
- **Left panel**: Training and validation loss over time
- **Middle panel**: Training and validation accuracy progression
- **Right panel**: Validation F1 score (macro average) with best checkpoint marked
The model converged quickly, reaching optimal performance at epoch 6 with minimal overfitting.
## Usage
### Try it in Google Colab
[](https://colab.research.google.com/drive/1cxgJVwKbmQFtTzRTeXVtpn7vKj-sdU1n?usp=sharing)
Test the model instantly in your browser without any setup! The Colab notebook includes examples in multiple languages and demonstrates all classification categories.
### Installation
```bash
pip install transformers torch
```
### Quick Start
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "AliSalman29/nfqa-multilingual-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example questions in different languages
questions = [
"What is the capital of France?", # English - FACTOID
"¿Cómo hacer una tortilla española?", # Spanish - INSTRUCTION
"Warum ist der Himmel blau?", # German - REASON
"iPhone還是Android更好?", # Chinese - COMPARISON
]
# Classify questions
for question in questions:
inputs = tokenizer(question, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1).item()
confidence = predictions[0][predicted_class].item()
# Get category name
category = model.config.id2label[predicted_class]
print(f"Question: {question}")
print(f"Category: {category}")
print(f"Confidence: {confidence:.2%}\n")
```
### Output Example
```
Question: What is the capital of France?
Category: FACTOID
Confidence: 94.32%
Question: ¿Cómo hacer una tortilla española?
Category: INSTRUCTION
Confidence: 89.17%
Question: Warum ist der Himmel blau?
Category: REASON
Confidence: 85.63%
Question: iPhone還是Android更好?
Category: COMPARISON
Confidence: 91.24%
```
### Batch Processing
```python
def classify_questions_batch(questions, model, tokenizer, batch_size=32):
"""Classify multiple questions efficiently"""
model.eval()
results = []
for i in range(0, len(questions), batch_size):
batch = questions[i:i+batch_size]
# Tokenize batch
inputs = tokenizer(
batch,
return_tensors="pt",
truncation=True,
max_length=128,
padding=True
)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_classes = torch.argmax(predictions, dim=-1)
confidences = predictions[range(len(batch)), predicted_classes]
# Store results
for j, question in enumerate(batch):
results.append({
'question': question,
'category': model.config.id2label[predicted_classes[j].item()],
'label_id': predicted_classes[j].item(),
'confidence': confidences[j].item()
})
return results
# Usage
questions = ["Question 1", "Question 2", ...]
results = classify_questions_batch(questions, model, tokenizer)
```
### Integration with Pipelines
```python
from transformers import pipeline
# Create classification pipeline
classifier = pipeline(
"text-classification",
model="AliSalman29/nfqa-multilingual-classifier",
tokenizer="AliSalman29/nfqa-multilingual-classifier",
device=0 # Use GPU if available (0), or -1 for CPU
)
# Classify single question
result = classifier("How do I learn Python?", truncation=True, max_length=128)
print(result)
# Output: [{'label': 'INSTRUCTION', 'score': 0.91}]
# Classify multiple questions
results = classifier(
["What is AI?", "Why do cats purr?", "Best pizza in town?"],
truncation=True,
max_length=128
)
for r in results:
print(f"{r['label']}: {r['score']:.2%}")
```
## Limitations and Biases
### Known Limitations
1. **Language Imbalance**: While supporting 49 languages, the model may perform better on high-resource languages (English, Spanish, French) compared to low-resource languages
2. **Domain Specificity**: Trained primarily on FAQ-style questions; may not generalize perfectly to other question formats (e.g., academic questions, technical queries)
3. **Category Overlap**: Some questions may legitimately belong to multiple categories, but the model outputs a single prediction
4. **Short Questions**: Very short questions (1-2 words) may lack sufficient context for accurate classification
5. **Context Dependency**: The model analyzes questions in isolation without conversational context
### Potential Biases
- **Annotation Bias**: Labels are based on LLM ensemble predictions (Llama 3.1, Gemma 2, Qwen 2.5) rather than human annotations, which may introduce systematic biases from these underlying models
- **Training Data Bias**: The model inherits biases from the WebFAQ dataset and AI-generated examples
- **Language Representation**: While the dataset includes 49 languages, some language families may have different performance characteristics
- **Category Distribution**: The balanced dataset has similar representation across categories (~980 examples each in test set), which may differ from real-world distributions
- **Domain Specificity**: Trained primarily on FAQ-style and general questions; performance may vary on domain-specific questions
### Recommendations for Use
- Use confidence scores to identify uncertain predictions
- Consider ensemble approaches for critical applications
- Validate performance on your specific domain and languages before production deployment
- Implement human review for high-stakes decisions
- Monitor performance across different language groups in your application
## Ethical Considerations
- **Transparency**: Users should be informed when interacting with automated classification systems
- **Privacy**: The model processes text locally and does not store or transmit user queries
- **Fairness**: Regular audits should be conducted to ensure equitable performance across languages and user groups
- **Accountability**: Human oversight is recommended for applications affecting user experience or decisions
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{nfqa-multilingual-2026,
author = {Ali Salman},
title = {NFQA Multilingual Question Classifier},
year = {2026},
publisher = {HuggingFace},
journal = {HuggingFace Model Hub},
howpublished = {\url{https://huggingface.co/AliSalman29/nfqa-multilingual-classifier}}
}
```
Please also cite the training dataset:
```bibtex
@dataset{nfqa_multilingual_dataset_2026,
author = {Ali Salman},
title = {NFQA Multilingual Dataset: A Large-Scale Dataset for Non-Factoid Question Classification},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset}}
}
```
## Related Resources
- **Training Dataset**: [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
- **WebFAQ Dataset**: [PaDaS-Lab/webfaq](https://huggingface.co/datasets/PaDaS-Lab/webfaq)
- **XLM-RoBERTa**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
## Model Card Contact
For questions, feedback, or issues:
- **GitHub Issues**: https://github.com/Ali-Salman29/nfqa-multilingual-classifier
- **Email**: salman.khuwaja29@gmail.com
- **Organization**: University of Passau
## Acknowledgments
- Training dataset: [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
- Source data: [WebFAQ Dataset](https://huggingface.co/datasets/PaDaS-Lab/webfaq)
- Built on the [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base) foundation model by Meta AI
- Annotation and generation using Llama 3.1, Gemma 2, and Qwen 2.5
---
**Model Version**: 1.0
**Last Updated**: February 2026
**Status**: Production Ready
|