Configuration Parsing Warning: Config file config.json cannot be fetched (too big)
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

NFQA Multilingual Question Classifier

A multilingual question classification model that categorizes questions into 8 distinct types based on the Non-Factoid Question Answering (NFQA) taxonomy.

Model Description

This model classifies questions across 49 languages into 8 categories of question types, enabling better understanding of user intent and question characteristics for information retrieval and question answering systems.

Model Details

  • Model Type: Multilingual Text Classification
  • Base Model: xlm-roberta-base
  • Languages: 49 languages (European, Asian, and Middle Eastern languages)
  • Categories: 8 NFQA question types
  • Parameters: ~278M parameters
  • Training Date: January 2026
  • License: apache-2.0

Developers

Developed by Ali Salman for research in multilingual question understanding and classification.

Architecture

The model is based on XLM-RoBERTa (Cross-lingual Language Model - Robustly Optimized BERT Approach), a transformer-based multilingual encoder:

  • Base Architecture: 12-layer transformer encoder
  • Hidden Size: 768
  • Attention Heads: 12
  • Parameters: ~278M
  • Vocabulary Size: 250,000 tokens (SentencePiece)
  • Pre-training: Trained on 2.5TB of CommonCrawl data in 100 languages
  • Fine-tuning: Classification head with dropout (0.2) for 8-class NFQA classification

Intended Use

Primary Use Cases

  • Question Type Classification: Automatically categorize user questions to route them to appropriate answering systems
  • Search Intent Understanding: Enhance search engines by understanding the type of information users seek
  • Chatbot Development: Improve conversational AI by identifying question types
  • FAQ Organization: Automatically organize FAQ databases by question type
  • Content Recommendation: Suggest relevant content based on question type

Out-of-Scope Use

  • This model is NOT designed for content moderation or filtering
  • Should not be used as the sole decision-maker in high-stakes applications
  • Not suitable for detecting malicious intent or harmful content

Training Data

Dataset

The model was trained on the NFQA Multilingual Dataset, a large-scale multilingual dataset for non-factoid question classification.

Dataset Composition:

  • Training: 33,602 examples (70%)
  • Validation: 6,979 examples (15%)
  • Test: 7,696 examples (15%)
  • Total: 48,277 balanced examples

Source Distribution:

  • 54% from WebFAQ dataset (annotated with LLM ensemble)
  • 46% AI-generated to balance language-category combinations

Key Features:

  • 392 unique (language, category) combinations
  • Target of ~125 examples per combination
  • Stratified sampling to ensure balanced representation
  • Ensemble annotation using Llama 3.1, Gemma 2, and Qwen 2.5

For detailed information about dataset generation, annotation methodology, and data composition, please visit the dataset page

Languages Supported

European Languages (29): English (en), German (de), French (fr), Spanish (es), Italian (it), Portuguese (pt), Dutch (nl), Polish (pl), Romanian (ro), Czech (cs), Slovak (sk), Bulgarian (bg), Croatian (hr), Serbian (sr), Slovenian (sl), Albanian (sq), Estonian (et), Latvian (lv), Lithuanian (lt), Danish (da), Norwegian (no), Swedish (sv), Finnish (fi), Icelandic (is), Greek (el), Turkish (tr), Ukrainian (uk), Russian (ru), Hungarian (hu)

Asian Languages (12): Chinese (zh), Japanese (ja), Korean (ko), Hindi (hi), Bengali (bn), Marathi (mr), Thai (th), Vietnamese (vi), Indonesian (id), Malay (ms), Tagalog/Filipino (tl), Urdu (ur)

Middle Eastern Languages (8): Arabic (ar), Persian/Farsi (fa), Hebrew (he), Georgian (ka), Azerbaijani (az), Kazakh (kk), Uzbek (uz)

Classification Categories

The model classifies questions into 8 distinct categories:

1. NOT-A-QUESTION (Label 0)

Statements or phrases that are not actual questions.

Examples:

  • "Price of dental treatment"
  • "Best restaurants nearby"
  • "Weather today"

2. FACTOID (Label 1)

Questions seeking factual, objective answers (who, what, when, where).

Examples:

  • "What is the capital of France?"
  • "When was the Eiffel Tower built?"
  • "Who invented the telephone?"

3. DEBATE (Label 2)

Hypothetical, opinion-based, or debatable questions.

Examples:

  • "Is artificial intelligence dangerous?"
  • "Should we colonize Mars?"
  • "Is remote work better than office work?"

4. EVIDENCE-BASED (Label 3)

Questions about definitions, features, or characteristics.

Examples:

  • "What are the symptoms of flu?"
  • "What features does this phone have?"
  • "What is machine learning?"

5. INSTRUCTION (Label 4)

How-to questions requiring step-by-step procedural answers.

Examples:

  • "How do I reset my password?"
  • "How to bake chocolate chip cookies?"
  • "How can I install Python on Windows?"

6. REASON (Label 5)

Why/how questions seeking explanations or reasoning.

Examples:

  • "Why is the sky blue?"
  • "How does photosynthesis work?"
  • "Why do birds migrate?"

7. EXPERIENCE (Label 6)

Questions seeking personal experiences, recommendations, or advice.

Examples:

  • "What's the best laptop for students?"
  • "Has anyone tried this restaurant?"
  • "Which hotel would you recommend?"

8. COMPARISON (Label 7)

Questions comparing two or more options.

Examples:

  • "iPhone vs Android: which is better?"
  • "What's the difference between RNA and DNA?"
  • "Compare electric and gas cars"

Model Performance

Test Set Results (7,696 examples)

  • Overall Accuracy: 88.1%
  • Macro-Average F1: 88.1%
  • Best Validation F1: 88.1% (achieved at epoch 6)

Per-Category Performance

Category Precision Recall F1-Score Support
NOT-A-QUESTION 0.96 0.92 0.94 950
FACTOID 0.84 0.79 0.81 980
DEBATE 0.90 0.95 0.92 916
EVIDENCE-BASED 0.86 0.92 0.89 950
INSTRUCTION 0.85 0.92 0.88 980
REASON 0.88 0.86 0.87 960
EXPERIENCE 0.82 0.76 0.79 980
COMPARISON 0.93 0.93 0.93 980

Key Observations

  • Strongest Performance: NOT-A-QUESTION, COMPARISON, and DEBATE categories (F1 ≥ 0.92)
  • Good Performance: EVIDENCE-BASED, INSTRUCTION, and REASON categories (F1 ≥ 0.87)
  • Moderate Performance: FACTOID and EXPERIENCE categories (F1 ~ 0.79-0.81)
  • The model generalizes well across all 49 languages with balanced test set distribution

Confusion Matrix

Confusion Matrix

The confusion matrix shows the model's prediction patterns across all 8 categories. The diagonal elements represent correct classifications, while off-diagonal elements show misclassifications between categories.

Training Procedure

Hardware

  • Training Device: CUDA-enabled GPU (NVIDIA)
  • Training Time: 6 epochs to reach best performance

Hyperparameters

{
  "model_name": "xlm-roberta-base",
  "max_length": 128,              # Maximum sequence length
  "batch_size": 16,                # Training batch size
  "learning_rate": 2e-5,           # AdamW learning rate
  "num_epochs": 6,                 # Total epochs trained
  "warmup_steps": 500,             # Linear warmup steps
  "weight_decay": 0.01,            # L2 regularization
  "dropout": 0.2,                  # Dropout probability
  "optimizer": "AdamW",            # Optimizer
  "scheduler": "linear_warmup",    # Learning rate scheduler
  "gradient_clipping": 1.0,        # Max gradient norm
  "random_seed": 42                # Reproducibility
}

Training Process

  1. Data Preparation: Pre-split balanced dataset from NFQA Multilingual Dataset

    • Training: 33,602 examples (70%)
    • Validation: 6,979 examples (15%)
    • Test: 7,696 examples (15%)
  2. Preprocessing: Tokenization using XLM-RoBERTa tokenizer (max length: 128 tokens)

  3. Training Strategy: Supervised fine-tuning with stratified train/val/test splits

    • Stratified by (language, category) combinations to maintain balance
  4. Optimization: AdamW optimizer with linear warmup and gradient clipping

    • Total training steps: 12,606 (33,602 examples × 6 epochs ÷ 16 batch size)
    • Warmup steps: 500
  5. Best Model Selection: Model checkpoint with highest validation F1 score (epoch 6)

  6. Evaluation: Comprehensive testing on held-out test set with per-category and per-language analysis

Training Curves

Training Curves

The training curves show the model's learning progress across 6 epochs:

  • Left panel: Training and validation loss over time
  • Middle panel: Training and validation accuracy progression
  • Right panel: Validation F1 score (macro average) with best checkpoint marked

The model converged quickly, reaching optimal performance at epoch 6 with minimal overfitting.

Usage

Try it in Google Colab

Open In Colab

Test the model instantly in your browser without any setup! The Colab notebook includes examples in multiple languages and demonstrates all classification categories.

Installation

pip install transformers torch

Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "AliSalman29/nfqa-multilingual-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example questions in different languages
questions = [
    "What is the capital of France?",           # English - FACTOID
    "¿Cómo hacer una tortilla española?",       # Spanish - INSTRUCTION
    "Warum ist der Himmel blau?",               # German - REASON
    "iPhone還是Android更好?",                    # Chinese - COMPARISON
]

# Classify questions
for question in questions:
    inputs = tokenizer(question, return_tensors="pt", truncation=True, max_length=128)

    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()
        confidence = predictions[0][predicted_class].item()

    # Get category name
    category = model.config.id2label[predicted_class]

    print(f"Question: {question}")
    print(f"Category: {category}")
    print(f"Confidence: {confidence:.2%}\n")

Output Example

Question: What is the capital of France?
Category: FACTOID
Confidence: 94.32%

Question: ¿Cómo hacer una tortilla española?
Category: INSTRUCTION
Confidence: 89.17%

Question: Warum ist der Himmel blau?
Category: REASON
Confidence: 85.63%

Question: iPhone還是Android更好?
Category: COMPARISON
Confidence: 91.24%

Batch Processing

def classify_questions_batch(questions, model, tokenizer, batch_size=32):
    """Classify multiple questions efficiently"""
    model.eval()
    results = []

    for i in range(0, len(questions), batch_size):
        batch = questions[i:i+batch_size]

        # Tokenize batch
        inputs = tokenizer(
            batch,
            return_tensors="pt",
            truncation=True,
            max_length=128,
            padding=True
        )

        # Get predictions
        with torch.no_grad():
            outputs = model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_classes = torch.argmax(predictions, dim=-1)
            confidences = predictions[range(len(batch)), predicted_classes]

        # Store results
        for j, question in enumerate(batch):
            results.append({
                'question': question,
                'category': model.config.id2label[predicted_classes[j].item()],
                'label_id': predicted_classes[j].item(),
                'confidence': confidences[j].item()
            })

    return results

# Usage
questions = ["Question 1", "Question 2", ...]
results = classify_questions_batch(questions, model, tokenizer)

Integration with Pipelines

from transformers import pipeline

# Create classification pipeline
classifier = pipeline(
    "text-classification",
    model="AliSalman29/nfqa-multilingual-classifier",
    tokenizer="AliSalman29/nfqa-multilingual-classifier",
    device=0  # Use GPU if available (0), or -1 for CPU
)

# Classify single question
result = classifier("How do I learn Python?", truncation=True, max_length=128)
print(result)
# Output: [{'label': 'INSTRUCTION', 'score': 0.91}]

# Classify multiple questions
results = classifier(
    ["What is AI?", "Why do cats purr?", "Best pizza in town?"],
    truncation=True,
    max_length=128
)
for r in results:
    print(f"{r['label']}: {r['score']:.2%}")

Limitations and Biases

Known Limitations

  1. Language Imbalance: While supporting 49 languages, the model may perform better on high-resource languages (English, Spanish, French) compared to low-resource languages
  2. Domain Specificity: Trained primarily on FAQ-style questions; may not generalize perfectly to other question formats (e.g., academic questions, technical queries)
  3. Category Overlap: Some questions may legitimately belong to multiple categories, but the model outputs a single prediction
  4. Short Questions: Very short questions (1-2 words) may lack sufficient context for accurate classification
  5. Context Dependency: The model analyzes questions in isolation without conversational context

Potential Biases

  • Annotation Bias: Labels are based on LLM ensemble predictions (Llama 3.1, Gemma 2, Qwen 2.5) rather than human annotations, which may introduce systematic biases from these underlying models
  • Training Data Bias: The model inherits biases from the WebFAQ dataset and AI-generated examples
  • Language Representation: While the dataset includes 49 languages, some language families may have different performance characteristics
  • Category Distribution: The balanced dataset has similar representation across categories (~980 examples each in test set), which may differ from real-world distributions
  • Domain Specificity: Trained primarily on FAQ-style and general questions; performance may vary on domain-specific questions

Recommendations for Use

  • Use confidence scores to identify uncertain predictions
  • Consider ensemble approaches for critical applications
  • Validate performance on your specific domain and languages before production deployment
  • Implement human review for high-stakes decisions
  • Monitor performance across different language groups in your application

Ethical Considerations

  • Transparency: Users should be informed when interacting with automated classification systems
  • Privacy: The model processes text locally and does not store or transmit user queries
  • Fairness: Regular audits should be conducted to ensure equitable performance across languages and user groups
  • Accountability: Human oversight is recommended for applications affecting user experience or decisions

Citation

If you use this model in your research, please cite:

@misc{nfqa-multilingual-2026,
  author = {Ali Salman},
  title = {NFQA Multilingual Question Classifier},
  year = {2026},
  publisher = {HuggingFace},
  journal = {HuggingFace Model Hub},
  howpublished = {\url{https://huggingface.co/AliSalman29/nfqa-multilingual-classifier}}
}

Please also cite the training dataset:

@dataset{nfqa_multilingual_dataset_2026,
  author = {Ali Salman},
  title = {NFQA Multilingual Dataset: A Large-Scale Dataset for Non-Factoid Question Classification},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset}}
}

Related Resources

Model Card Contact

For questions, feedback, or issues:

Acknowledgments


Model Version: 1.0 Last Updated: February 2026 Status: Production Ready

Downloads last month
28
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support