File size: 17,307 Bytes
d802797
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db6aa40
d802797
 
 
db6aa40
 
 
 
 
 
 
 
 
 
 
 
 
d802797
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db6aa40
d802797
db6aa40
d802797
db6aa40
 
 
 
 
d802797
db6aa40
 
 
d802797
db6aa40
 
 
 
 
d802797
db6aa40
d802797
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db6aa40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d802797
 
 
 
 
 
 
db6aa40
d802797
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db6aa40
 
d802797
 
db6aa40
 
 
d802797
 
 
db6aa40
d802797
db6aa40
 
 
d802797
 
 
 
 
db6aa40
 
 
 
 
 
 
 
d802797
 
 
db6aa40
 
 
 
 
 
 
 
 
 
d802797
 
 
 
 
 
db6aa40
d802797
 
 
 
 
 
 
 
 
db6aa40
d802797
 
db6aa40
d802797
 
 
 
 
 
 
 
 
db6aa40
 
 
 
 
d802797
db6aa40
d802797
db6aa40
 
d802797
db6aa40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d802797
 
 
483d812
 
 
 
 
 
d802797
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db6aa40
 
 
 
 
d802797
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db6aa40
d802797
 
 
 
 
 
 
 
db6aa40
 
 
 
 
 
 
 
 
 
 
 
d802797
 
db6aa40
 
 
d802797
 
 
 
 
 
 
 
 
 
db6aa40
 
 
 
d802797
 
 
 
db6aa40
d802797
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
# NFQA Multilingual Question Classifier

A multilingual question classification model that categorizes questions into 8 distinct types based on the Non-Factoid Question Answering (NFQA) taxonomy.

## Model Description

This model classifies questions across **49 languages** into **8 categories** of question types, enabling better understanding of user intent and question characteristics for information retrieval and question answering systems.

### Model Details

- **Model Type**: Multilingual Text Classification
- **Base Model**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
- **Languages**: 49 languages (European, Asian, and Middle Eastern languages)
- **Categories**: 8 NFQA question types
- **Parameters**: ~278M parameters
- **Training Date**: January 2026
- **License**: apache-2.0

### Developers

Developed by Ali Salman for research in multilingual question understanding and classification.

### Architecture

The model is based on XLM-RoBERTa (Cross-lingual Language Model - Robustly Optimized BERT Approach), a transformer-based multilingual encoder:

- **Base Architecture**: 12-layer transformer encoder
- **Hidden Size**: 768
- **Attention Heads**: 12
- **Parameters**: ~278M
- **Vocabulary Size**: 250,000 tokens (SentencePiece)
- **Pre-training**: Trained on 2.5TB of CommonCrawl data in 100 languages
- **Fine-tuning**: Classification head with dropout (0.2) for 8-class NFQA classification

## Intended Use

### Primary Use Cases

- **Question Type Classification**: Automatically categorize user questions to route them to appropriate answering systems
- **Search Intent Understanding**: Enhance search engines by understanding the type of information users seek
- **Chatbot Development**: Improve conversational AI by identifying question types
- **FAQ Organization**: Automatically organize FAQ databases by question type
- **Content Recommendation**: Suggest relevant content based on question type

### Out-of-Scope Use

- This model is NOT designed for content moderation or filtering
- Should not be used as the sole decision-maker in high-stakes applications
- Not suitable for detecting malicious intent or harmful content

## Training Data

### Dataset

The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)**, a large-scale multilingual dataset for non-factoid question classification.

**Dataset Composition**:
- **Training**: 33,602 examples (70%)
- **Validation**: 6,979 examples (15%)
- **Test**: 7,696 examples (15%)
- **Total**: 48,277 balanced examples

**Source Distribution**:
- 54% from WebFAQ dataset (annotated with LLM ensemble)
- 46% AI-generated to balance language-category combinations

**Key Features**:
- 392 unique (language, category) combinations
- Target of ~125 examples per combination
- Stratified sampling to ensure balanced representation
- Ensemble annotation using Llama 3.1, Gemma 2, and Qwen 2.5

For detailed information about dataset generation, annotation methodology, and data composition, please visit the [dataset page](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)

### Languages Supported

**European Languages** (29): English (en), German (de), French (fr), Spanish (es), Italian (it), Portuguese (pt), Dutch (nl), Polish (pl), Romanian (ro), Czech (cs), Slovak (sk), Bulgarian (bg), Croatian (hr), Serbian (sr), Slovenian (sl), Albanian (sq), Estonian (et), Latvian (lv), Lithuanian (lt), Danish (da), Norwegian (no), Swedish (sv), Finnish (fi), Icelandic (is), Greek (el), Turkish (tr), Ukrainian (uk), Russian (ru), Hungarian (hu)

**Asian Languages** (12): Chinese (zh), Japanese (ja), Korean (ko), Hindi (hi), Bengali (bn), Marathi (mr), Thai (th), Vietnamese (vi), Indonesian (id), Malay (ms), Tagalog/Filipino (tl), Urdu (ur)

**Middle Eastern Languages** (8): Arabic (ar), Persian/Farsi (fa), Hebrew (he), Georgian (ka), Azerbaijani (az), Kazakh (kk), Uzbek (uz)

## Classification Categories

The model classifies questions into 8 distinct categories:

### 1. NOT-A-QUESTION (Label 0)
Statements or phrases that are not actual questions.

**Examples:**
- "Price of dental treatment"
- "Best restaurants nearby"
- "Weather today"

### 2. FACTOID (Label 1)
Questions seeking factual, objective answers (who, what, when, where).

**Examples:**
- "What is the capital of France?"
- "When was the Eiffel Tower built?"
- "Who invented the telephone?"

### 3. DEBATE (Label 2)
Hypothetical, opinion-based, or debatable questions.

**Examples:**
- "Is artificial intelligence dangerous?"
- "Should we colonize Mars?"
- "Is remote work better than office work?"

### 4. EVIDENCE-BASED (Label 3)
Questions about definitions, features, or characteristics.

**Examples:**
- "What are the symptoms of flu?"
- "What features does this phone have?"
- "What is machine learning?"

### 5. INSTRUCTION (Label 4)
How-to questions requiring step-by-step procedural answers.

**Examples:**
- "How do I reset my password?"
- "How to bake chocolate chip cookies?"
- "How can I install Python on Windows?"

### 6. REASON (Label 5)
Why/how questions seeking explanations or reasoning.

**Examples:**
- "Why is the sky blue?"
- "How does photosynthesis work?"
- "Why do birds migrate?"

### 7. EXPERIENCE (Label 6)
Questions seeking personal experiences, recommendations, or advice.

**Examples:**
- "What's the best laptop for students?"
- "Has anyone tried this restaurant?"
- "Which hotel would you recommend?"

### 8. COMPARISON (Label 7)
Questions comparing two or more options.

**Examples:**
- "iPhone vs Android: which is better?"
- "What's the difference between RNA and DNA?"
- "Compare electric and gas cars"

## Model Performance

### Test Set Results (7,696 examples)

- **Overall Accuracy**: 88.1%
- **Macro-Average F1**: 88.1%
- **Best Validation F1**: 88.1% (achieved at epoch 6)

### Per-Category Performance

| Category | Precision | Recall | F1-Score | Support |
|----------|-----------|--------|----------|---------|
| NOT-A-QUESTION | 0.96 | 0.92 | 0.94 | 950 |
| FACTOID | 0.84 | 0.79 | 0.81 | 980 |
| DEBATE | 0.90 | 0.95 | 0.92 | 916 |
| EVIDENCE-BASED | 0.86 | 0.92 | 0.89 | 950 |
| INSTRUCTION | 0.85 | 0.92 | 0.88 | 980 |
| REASON | 0.88 | 0.86 | 0.87 | 960 |
| EXPERIENCE | 0.82 | 0.76 | 0.79 | 980 |
| COMPARISON | 0.93 | 0.93 | 0.93 | 980 |

### Key Observations

- **Strongest Performance**: NOT-A-QUESTION, COMPARISON, and DEBATE categories (F1 ≥ 0.92)
- **Good Performance**: EVIDENCE-BASED, INSTRUCTION, and REASON categories (F1 ≥ 0.87)
- **Moderate Performance**: FACTOID and EXPERIENCE categories (F1 ~ 0.79-0.81)
- The model generalizes well across all 49 languages with balanced test set distribution

### Confusion Matrix

![Confusion Matrix](confusion_matrix.png)

The confusion matrix shows the model's prediction patterns across all 8 categories. The diagonal elements represent correct classifications, while off-diagonal elements show misclassifications between categories.

## Training Procedure

### Hardware

- Training Device: CUDA-enabled GPU (NVIDIA)
- Training Time: 6 epochs to reach best performance

### Hyperparameters

```python
{
  "model_name": "xlm-roberta-base",
  "max_length": 128,              # Maximum sequence length
  "batch_size": 16,                # Training batch size
  "learning_rate": 2e-5,           # AdamW learning rate
  "num_epochs": 6,                 # Total epochs trained
  "warmup_steps": 500,             # Linear warmup steps
  "weight_decay": 0.01,            # L2 regularization
  "dropout": 0.2,                  # Dropout probability
  "optimizer": "AdamW",            # Optimizer
  "scheduler": "linear_warmup",    # Learning rate scheduler
  "gradient_clipping": 1.0,        # Max gradient norm
  "random_seed": 42                # Reproducibility
}
```

### Training Process

1. **Data Preparation**: Pre-split balanced dataset from [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
   - Training: 33,602 examples (70%)
   - Validation: 6,979 examples (15%)
   - Test: 7,696 examples (15%)

2. **Preprocessing**: Tokenization using XLM-RoBERTa tokenizer (max length: 128 tokens)

3. **Training Strategy**: Supervised fine-tuning with stratified train/val/test splits
   - Stratified by (language, category) combinations to maintain balance

4. **Optimization**: AdamW optimizer with linear warmup and gradient clipping
   - Total training steps: 12,606 (33,602 examples × 6 epochs ÷ 16 batch size)
   - Warmup steps: 500

5. **Best Model Selection**: Model checkpoint with highest validation F1 score (epoch 6)

6. **Evaluation**: Comprehensive testing on held-out test set with per-category and per-language analysis

### Training Curves

![Training Curves](training_curves.png)

The training curves show the model's learning progress across 6 epochs:
- **Left panel**: Training and validation loss over time
- **Middle panel**: Training and validation accuracy progression
- **Right panel**: Validation F1 score (macro average) with best checkpoint marked

The model converged quickly, reaching optimal performance at epoch 6 with minimal overfitting.

## Usage

### Try it in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1cxgJVwKbmQFtTzRTeXVtpn7vKj-sdU1n?usp=sharing)

Test the model instantly in your browser without any setup! The Colab notebook includes examples in multiple languages and demonstrates all classification categories.

### Installation

```bash
pip install transformers torch
```

### Quick Start

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "AliSalman29/nfqa-multilingual-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example questions in different languages
questions = [
    "What is the capital of France?",           # English - FACTOID
    "¿Cómo hacer una tortilla española?",       # Spanish - INSTRUCTION
    "Warum ist der Himmel blau?",               # German - REASON
    "iPhone還是Android更好?",                    # Chinese - COMPARISON
]

# Classify questions
for question in questions:
    inputs = tokenizer(question, return_tensors="pt", truncation=True, max_length=128)

    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()
        confidence = predictions[0][predicted_class].item()

    # Get category name
    category = model.config.id2label[predicted_class]

    print(f"Question: {question}")
    print(f"Category: {category}")
    print(f"Confidence: {confidence:.2%}\n")
```

### Output Example

```
Question: What is the capital of France?
Category: FACTOID
Confidence: 94.32%

Question: ¿Cómo hacer una tortilla española?
Category: INSTRUCTION
Confidence: 89.17%

Question: Warum ist der Himmel blau?
Category: REASON
Confidence: 85.63%

Question: iPhone還是Android更好?
Category: COMPARISON
Confidence: 91.24%
```

### Batch Processing

```python
def classify_questions_batch(questions, model, tokenizer, batch_size=32):
    """Classify multiple questions efficiently"""
    model.eval()
    results = []

    for i in range(0, len(questions), batch_size):
        batch = questions[i:i+batch_size]

        # Tokenize batch
        inputs = tokenizer(
            batch,
            return_tensors="pt",
            truncation=True,
            max_length=128,
            padding=True
        )

        # Get predictions
        with torch.no_grad():
            outputs = model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_classes = torch.argmax(predictions, dim=-1)
            confidences = predictions[range(len(batch)), predicted_classes]

        # Store results
        for j, question in enumerate(batch):
            results.append({
                'question': question,
                'category': model.config.id2label[predicted_classes[j].item()],
                'label_id': predicted_classes[j].item(),
                'confidence': confidences[j].item()
            })

    return results

# Usage
questions = ["Question 1", "Question 2", ...]
results = classify_questions_batch(questions, model, tokenizer)
```

### Integration with Pipelines

```python
from transformers import pipeline

# Create classification pipeline
classifier = pipeline(
    "text-classification",
    model="AliSalman29/nfqa-multilingual-classifier",
    tokenizer="AliSalman29/nfqa-multilingual-classifier",
    device=0  # Use GPU if available (0), or -1 for CPU
)

# Classify single question
result = classifier("How do I learn Python?", truncation=True, max_length=128)
print(result)
# Output: [{'label': 'INSTRUCTION', 'score': 0.91}]

# Classify multiple questions
results = classifier(
    ["What is AI?", "Why do cats purr?", "Best pizza in town?"],
    truncation=True,
    max_length=128
)
for r in results:
    print(f"{r['label']}: {r['score']:.2%}")
```

## Limitations and Biases

### Known Limitations

1. **Language Imbalance**: While supporting 49 languages, the model may perform better on high-resource languages (English, Spanish, French) compared to low-resource languages
2. **Domain Specificity**: Trained primarily on FAQ-style questions; may not generalize perfectly to other question formats (e.g., academic questions, technical queries)
3. **Category Overlap**: Some questions may legitimately belong to multiple categories, but the model outputs a single prediction
4. **Short Questions**: Very short questions (1-2 words) may lack sufficient context for accurate classification
5. **Context Dependency**: The model analyzes questions in isolation without conversational context

### Potential Biases

- **Annotation Bias**: Labels are based on LLM ensemble predictions (Llama 3.1, Gemma 2, Qwen 2.5) rather than human annotations, which may introduce systematic biases from these underlying models
- **Training Data Bias**: The model inherits biases from the WebFAQ dataset and AI-generated examples
- **Language Representation**: While the dataset includes 49 languages, some language families may have different performance characteristics
- **Category Distribution**: The balanced dataset has similar representation across categories (~980 examples each in test set), which may differ from real-world distributions
- **Domain Specificity**: Trained primarily on FAQ-style and general questions; performance may vary on domain-specific questions

### Recommendations for Use

- Use confidence scores to identify uncertain predictions
- Consider ensemble approaches for critical applications
- Validate performance on your specific domain and languages before production deployment
- Implement human review for high-stakes decisions
- Monitor performance across different language groups in your application

## Ethical Considerations

- **Transparency**: Users should be informed when interacting with automated classification systems
- **Privacy**: The model processes text locally and does not store or transmit user queries
- **Fairness**: Regular audits should be conducted to ensure equitable performance across languages and user groups
- **Accountability**: Human oversight is recommended for applications affecting user experience or decisions

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{nfqa-multilingual-2026,
  author = {Ali Salman},
  title = {NFQA Multilingual Question Classifier},
  year = {2026},
  publisher = {HuggingFace},
  journal = {HuggingFace Model Hub},
  howpublished = {\url{https://huggingface.co/AliSalman29/nfqa-multilingual-classifier}}
}
```

Please also cite the training dataset:

```bibtex
@dataset{nfqa_multilingual_dataset_2026,
  author = {Ali Salman},
  title = {NFQA Multilingual Dataset: A Large-Scale Dataset for Non-Factoid Question Classification},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset}}
}
```

## Related Resources

- **Training Dataset**: [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
- **WebFAQ Dataset**: [PaDaS-Lab/webfaq](https://huggingface.co/datasets/PaDaS-Lab/webfaq)
- **XLM-RoBERTa**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)

## Model Card Contact

For questions, feedback, or issues:
- **GitHub Issues**: https://github.com/Ali-Salman29/nfqa-multilingual-classifier
- **Email**: salman.khuwaja29@gmail.com
- **Organization**: University of Passau

## Acknowledgments

- Training dataset: [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
- Source data: [WebFAQ Dataset](https://huggingface.co/datasets/PaDaS-Lab/webfaq)
- Built on the [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base) foundation model by Meta AI
- Annotation and generation using Llama 3.1, Gemma 2, and Qwen 2.5

---

**Model Version**: 1.0
**Last Updated**: February 2026
**Status**: Production Ready