NFQA Multilingual Question Classifier
A multilingual question classification model that categorizes questions into 8 distinct types based on the Non-Factoid Question Answering (NFQA) taxonomy.
Model Description
This model classifies questions across 49 languages into 8 categories of question types, enabling better understanding of user intent and question characteristics for information retrieval and question answering systems.
Model Details
- Model Type: Multilingual Text Classification
- Base Model: xlm-roberta-base
- Languages: 49 languages (European, Asian, and Middle Eastern languages)
- Categories: 8 NFQA question types
- Parameters: ~278M parameters
- Training Date: January 2026
- License: apache-2.0
Developers
Developed by Ali Salman for research in multilingual question understanding and classification.
Architecture
The model is based on XLM-RoBERTa (Cross-lingual Language Model - Robustly Optimized BERT Approach), a transformer-based multilingual encoder:
- Base Architecture: 12-layer transformer encoder
- Hidden Size: 768
- Attention Heads: 12
- Parameters: ~278M
- Vocabulary Size: 250,000 tokens (SentencePiece)
- Pre-training: Trained on 2.5TB of CommonCrawl data in 100 languages
- Fine-tuning: Classification head with dropout (0.2) for 8-class NFQA classification
Intended Use
Primary Use Cases
- Question Type Classification: Automatically categorize user questions to route them to appropriate answering systems
- Search Intent Understanding: Enhance search engines by understanding the type of information users seek
- Chatbot Development: Improve conversational AI by identifying question types
- FAQ Organization: Automatically organize FAQ databases by question type
- Content Recommendation: Suggest relevant content based on question type
Out-of-Scope Use
- This model is NOT designed for content moderation or filtering
- Should not be used as the sole decision-maker in high-stakes applications
- Not suitable for detecting malicious intent or harmful content
Training Data
Dataset
The model was trained on the NFQA Multilingual Dataset, a large-scale multilingual dataset for non-factoid question classification.
Dataset Composition:
- Training: 33,602 examples (70%)
- Validation: 6,979 examples (15%)
- Test: 7,696 examples (15%)
- Total: 48,277 balanced examples
Source Distribution:
- 54% from WebFAQ dataset (annotated with LLM ensemble)
- 46% AI-generated to balance language-category combinations
Key Features:
- 392 unique (language, category) combinations
- Target of ~125 examples per combination
- Stratified sampling to ensure balanced representation
- Ensemble annotation using Llama 3.1, Gemma 2, and Qwen 2.5
For detailed information about dataset generation, annotation methodology, and data composition, please visit the dataset page
Languages Supported
European Languages (29): English (en), German (de), French (fr), Spanish (es), Italian (it), Portuguese (pt), Dutch (nl), Polish (pl), Romanian (ro), Czech (cs), Slovak (sk), Bulgarian (bg), Croatian (hr), Serbian (sr), Slovenian (sl), Albanian (sq), Estonian (et), Latvian (lv), Lithuanian (lt), Danish (da), Norwegian (no), Swedish (sv), Finnish (fi), Icelandic (is), Greek (el), Turkish (tr), Ukrainian (uk), Russian (ru), Hungarian (hu)
Asian Languages (12): Chinese (zh), Japanese (ja), Korean (ko), Hindi (hi), Bengali (bn), Marathi (mr), Thai (th), Vietnamese (vi), Indonesian (id), Malay (ms), Tagalog/Filipino (tl), Urdu (ur)
Middle Eastern Languages (8): Arabic (ar), Persian/Farsi (fa), Hebrew (he), Georgian (ka), Azerbaijani (az), Kazakh (kk), Uzbek (uz)
Classification Categories
The model classifies questions into 8 distinct categories:
1. NOT-A-QUESTION (Label 0)
Statements or phrases that are not actual questions.
Examples:
- "Price of dental treatment"
- "Best restaurants nearby"
- "Weather today"
2. FACTOID (Label 1)
Questions seeking factual, objective answers (who, what, when, where).
Examples:
- "What is the capital of France?"
- "When was the Eiffel Tower built?"
- "Who invented the telephone?"
3. DEBATE (Label 2)
Hypothetical, opinion-based, or debatable questions.
Examples:
- "Is artificial intelligence dangerous?"
- "Should we colonize Mars?"
- "Is remote work better than office work?"
4. EVIDENCE-BASED (Label 3)
Questions about definitions, features, or characteristics.
Examples:
- "What are the symptoms of flu?"
- "What features does this phone have?"
- "What is machine learning?"
5. INSTRUCTION (Label 4)
How-to questions requiring step-by-step procedural answers.
Examples:
- "How do I reset my password?"
- "How to bake chocolate chip cookies?"
- "How can I install Python on Windows?"
6. REASON (Label 5)
Why/how questions seeking explanations or reasoning.
Examples:
- "Why is the sky blue?"
- "How does photosynthesis work?"
- "Why do birds migrate?"
7. EXPERIENCE (Label 6)
Questions seeking personal experiences, recommendations, or advice.
Examples:
- "What's the best laptop for students?"
- "Has anyone tried this restaurant?"
- "Which hotel would you recommend?"
8. COMPARISON (Label 7)
Questions comparing two or more options.
Examples:
- "iPhone vs Android: which is better?"
- "What's the difference between RNA and DNA?"
- "Compare electric and gas cars"
Model Performance
Test Set Results (7,696 examples)
- Overall Accuracy: 88.1%
- Macro-Average F1: 88.1%
- Best Validation F1: 88.1% (achieved at epoch 6)
Per-Category Performance
| Category | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| NOT-A-QUESTION | 0.96 | 0.92 | 0.94 | 950 |
| FACTOID | 0.84 | 0.79 | 0.81 | 980 |
| DEBATE | 0.90 | 0.95 | 0.92 | 916 |
| EVIDENCE-BASED | 0.86 | 0.92 | 0.89 | 950 |
| INSTRUCTION | 0.85 | 0.92 | 0.88 | 980 |
| REASON | 0.88 | 0.86 | 0.87 | 960 |
| EXPERIENCE | 0.82 | 0.76 | 0.79 | 980 |
| COMPARISON | 0.93 | 0.93 | 0.93 | 980 |
Key Observations
- Strongest Performance: NOT-A-QUESTION, COMPARISON, and DEBATE categories (F1 ≥ 0.92)
- Good Performance: EVIDENCE-BASED, INSTRUCTION, and REASON categories (F1 ≥ 0.87)
- Moderate Performance: FACTOID and EXPERIENCE categories (F1 ~ 0.79-0.81)
- The model generalizes well across all 49 languages with balanced test set distribution
Confusion Matrix
The confusion matrix shows the model's prediction patterns across all 8 categories. The diagonal elements represent correct classifications, while off-diagonal elements show misclassifications between categories.
Training Procedure
Hardware
- Training Device: CUDA-enabled GPU (NVIDIA)
- Training Time: 6 epochs to reach best performance
Hyperparameters
{
"model_name": "xlm-roberta-base",
"max_length": 128, # Maximum sequence length
"batch_size": 16, # Training batch size
"learning_rate": 2e-5, # AdamW learning rate
"num_epochs": 6, # Total epochs trained
"warmup_steps": 500, # Linear warmup steps
"weight_decay": 0.01, # L2 regularization
"dropout": 0.2, # Dropout probability
"optimizer": "AdamW", # Optimizer
"scheduler": "linear_warmup", # Learning rate scheduler
"gradient_clipping": 1.0, # Max gradient norm
"random_seed": 42 # Reproducibility
}
Training Process
Data Preparation: Pre-split balanced dataset from NFQA Multilingual Dataset
- Training: 33,602 examples (70%)
- Validation: 6,979 examples (15%)
- Test: 7,696 examples (15%)
Preprocessing: Tokenization using XLM-RoBERTa tokenizer (max length: 128 tokens)
Training Strategy: Supervised fine-tuning with stratified train/val/test splits
- Stratified by (language, category) combinations to maintain balance
Optimization: AdamW optimizer with linear warmup and gradient clipping
- Total training steps: 12,606 (33,602 examples × 6 epochs ÷ 16 batch size)
- Warmup steps: 500
Best Model Selection: Model checkpoint with highest validation F1 score (epoch 6)
Evaluation: Comprehensive testing on held-out test set with per-category and per-language analysis
Training Curves
The training curves show the model's learning progress across 6 epochs:
- Left panel: Training and validation loss over time
- Middle panel: Training and validation accuracy progression
- Right panel: Validation F1 score (macro average) with best checkpoint marked
The model converged quickly, reaching optimal performance at epoch 6 with minimal overfitting.
Usage
Try it in Google Colab
Test the model instantly in your browser without any setup! The Colab notebook includes examples in multiple languages and demonstrates all classification categories.
Installation
pip install transformers torch
Quick Start
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "AliSalman29/nfqa-multilingual-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example questions in different languages
questions = [
"What is the capital of France?", # English - FACTOID
"¿Cómo hacer una tortilla española?", # Spanish - INSTRUCTION
"Warum ist der Himmel blau?", # German - REASON
"iPhone還是Android更好?", # Chinese - COMPARISON
]
# Classify questions
for question in questions:
inputs = tokenizer(question, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1).item()
confidence = predictions[0][predicted_class].item()
# Get category name
category = model.config.id2label[predicted_class]
print(f"Question: {question}")
print(f"Category: {category}")
print(f"Confidence: {confidence:.2%}\n")
Output Example
Question: What is the capital of France?
Category: FACTOID
Confidence: 94.32%
Question: ¿Cómo hacer una tortilla española?
Category: INSTRUCTION
Confidence: 89.17%
Question: Warum ist der Himmel blau?
Category: REASON
Confidence: 85.63%
Question: iPhone還是Android更好?
Category: COMPARISON
Confidence: 91.24%
Batch Processing
def classify_questions_batch(questions, model, tokenizer, batch_size=32):
"""Classify multiple questions efficiently"""
model.eval()
results = []
for i in range(0, len(questions), batch_size):
batch = questions[i:i+batch_size]
# Tokenize batch
inputs = tokenizer(
batch,
return_tensors="pt",
truncation=True,
max_length=128,
padding=True
)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_classes = torch.argmax(predictions, dim=-1)
confidences = predictions[range(len(batch)), predicted_classes]
# Store results
for j, question in enumerate(batch):
results.append({
'question': question,
'category': model.config.id2label[predicted_classes[j].item()],
'label_id': predicted_classes[j].item(),
'confidence': confidences[j].item()
})
return results
# Usage
questions = ["Question 1", "Question 2", ...]
results = classify_questions_batch(questions, model, tokenizer)
Integration with Pipelines
from transformers import pipeline
# Create classification pipeline
classifier = pipeline(
"text-classification",
model="AliSalman29/nfqa-multilingual-classifier",
tokenizer="AliSalman29/nfqa-multilingual-classifier",
device=0 # Use GPU if available (0), or -1 for CPU
)
# Classify single question
result = classifier("How do I learn Python?", truncation=True, max_length=128)
print(result)
# Output: [{'label': 'INSTRUCTION', 'score': 0.91}]
# Classify multiple questions
results = classifier(
["What is AI?", "Why do cats purr?", "Best pizza in town?"],
truncation=True,
max_length=128
)
for r in results:
print(f"{r['label']}: {r['score']:.2%}")
Limitations and Biases
Known Limitations
- Language Imbalance: While supporting 49 languages, the model may perform better on high-resource languages (English, Spanish, French) compared to low-resource languages
- Domain Specificity: Trained primarily on FAQ-style questions; may not generalize perfectly to other question formats (e.g., academic questions, technical queries)
- Category Overlap: Some questions may legitimately belong to multiple categories, but the model outputs a single prediction
- Short Questions: Very short questions (1-2 words) may lack sufficient context for accurate classification
- Context Dependency: The model analyzes questions in isolation without conversational context
Potential Biases
- Annotation Bias: Labels are based on LLM ensemble predictions (Llama 3.1, Gemma 2, Qwen 2.5) rather than human annotations, which may introduce systematic biases from these underlying models
- Training Data Bias: The model inherits biases from the WebFAQ dataset and AI-generated examples
- Language Representation: While the dataset includes 49 languages, some language families may have different performance characteristics
- Category Distribution: The balanced dataset has similar representation across categories (~980 examples each in test set), which may differ from real-world distributions
- Domain Specificity: Trained primarily on FAQ-style and general questions; performance may vary on domain-specific questions
Recommendations for Use
- Use confidence scores to identify uncertain predictions
- Consider ensemble approaches for critical applications
- Validate performance on your specific domain and languages before production deployment
- Implement human review for high-stakes decisions
- Monitor performance across different language groups in your application
Ethical Considerations
- Transparency: Users should be informed when interacting with automated classification systems
- Privacy: The model processes text locally and does not store or transmit user queries
- Fairness: Regular audits should be conducted to ensure equitable performance across languages and user groups
- Accountability: Human oversight is recommended for applications affecting user experience or decisions
Citation
If you use this model in your research, please cite:
@misc{nfqa-multilingual-2026,
author = {Ali Salman},
title = {NFQA Multilingual Question Classifier},
year = {2026},
publisher = {HuggingFace},
journal = {HuggingFace Model Hub},
howpublished = {\url{https://huggingface.co/AliSalman29/nfqa-multilingual-classifier}}
}
Please also cite the training dataset:
@dataset{nfqa_multilingual_dataset_2026,
author = {Ali Salman},
title = {NFQA Multilingual Dataset: A Large-Scale Dataset for Non-Factoid Question Classification},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset}}
}
Related Resources
- Training Dataset: NFQA Multilingual Dataset
- WebFAQ Dataset: PaDaS-Lab/webfaq
- XLM-RoBERTa: xlm-roberta-base
Model Card Contact
For questions, feedback, or issues:
- GitHub Issues: https://github.com/Ali-Salman29/nfqa-multilingual-classifier
- Email: salman.khuwaja29@gmail.com
- Organization: University of Passau
Acknowledgments
- Training dataset: NFQA Multilingual Dataset
- Source data: WebFAQ Dataset
- Built on the XLM-RoBERTa foundation model by Meta AI
- Annotation and generation using Llama 3.1, Gemma 2, and Qwen 2.5
Model Version: 1.0 Last Updated: February 2026 Status: Production Ready
- Downloads last month
- 28

