Multilingual Sentence-Type Classifiers (6-Class)
A collection of 7 multilingual text classification models trained to classify sentences into 6 semantic-syntactic types. Each model is exported as ONNX for cross-platform inference.
Model Details
Task: Multi-class text classification (6 classes) Architecture: TF-IDF word features (unigrams + bigrams) + language-specific categorical features → LinearSVC Export Format: ONNX (TF-IDF + LinearSVC core) + scikit-learn pickle (full pipeline with categorical features) Training Framework: scikit-learn Test set performance: 94% average macro F1 across 7 languages (v0.8.0 with categorical features)
Languages & Performance
Performance on 15% held-out test split (1,485 samples per language, perfectly balanced).
| Language | Code | TF-IDF only (F1) | + Categorical (F1) | Delta |
|---|---|---|---|---|
| English | en | 0.95 | 0.96 | +0.01 |
| French | fr | 0.94 | 0.94 | — |
| Dutch | nl | 0.93 | 0.93 | — |
| German | de | 0.92 | 0.96 | +0.04 |
| Spanish | es | 0.90 | 0.95 | +0.05 |
| Portuguese | pt | 0.90 | 0.94 | +0.04 |
| Italian | it | 0.90 | 0.94 | +0.04 |
Categorical features bring measurable gains for DE, ES, IT and PT (+4–5pp). FR and NL rely more on word-order and morphology already captured by TF-IDF bigrams, so those two see no benefit from the rule-based feature layer.
Label Taxonomy (6 Classes)
| Label | Definition | Examples |
|---|---|---|
command |
Direct imperative with no polite framing. Verb-initial constructions ordering action. | "Close the door", "Stop talking", "Send me the file" |
exclamation |
Expressive, emphatic sentences conveying emotion or emphasis, often with "What a…!" or "How…!" constructions. | "What a beautiful sunset!", "How wonderful!", "That's incredible!" |
polar_question |
Yes/no questions seeking binary affirmation or negation, typically via auxiliary inversion or modal forms. | "Do you like coffee?", "Can you help me?", "Is it raining?" |
request |
Polite ask using conditional or modal forms ("Could you", "Would you", "Can you", "May I", "Might I"). Frames action as option rather than command. | "Could you pass the salt?", "Would you mind closing the window?", "May I borrow your pen?" |
statement |
Declarative sentences reporting facts, states, or observations with no interrogative or imperative structure. | "The Earth orbits the Sun", "I live in Paris", "She is a doctor" |
wh_question |
Open-ended information-seeking questions using wh-words (Who, What, When, Where, Why, How). Expects substantive answer, not binary response. | "Where are you from?", "What time is it?", "How does photosynthesis work?" |
Dataset & Training
Training Dataset: TigreGotico/sentence-types-multilingual
- Size: 69,300 sentences (9,900 per language)
- Split: 85% train (8,415 per language), 15% test (1,485 per language)
- Balance: 1,650 examples per class per language (perfectly balanced)
- Languages: English, Spanish, French, German, Italian, Portuguese, Dutch
Methodology:
- Expanded English dataset with hand-authored examples covering diverse contexts
- Applied rule-based label validation to fix classification drift
- Translated to 7 languages using Tower-Plus-2B-GGUF (q4_k_m quantization)
- Trained separate LinearSVC model per language (no transfer learning)
- Added language-specific categorical features (18 per language: intent signals, punctuation counts, lexical metrics, negation) — mutual information pruning kept only features with MI ≥ 0.01 bits
- Exported TF-IDF + LinearSVC core to ONNX via skl2onnx; full pipeline (including categorical features) available as pickle
Categorical Features
Each language model uses 18 hand-engineered features alongside TF-IDF:
| Group | Features |
|---|---|
| Intent signals | starts_wh, starts_polite, starts_command, starts_exclamation |
| Punctuation | ends_question, ends_exclamation, ends_period, question_mark_count, exclamation_mark_count, ellipsis_count |
| Lexical | sentence_length, unique_token_count, lexical_diversity, avg_token_length |
| Structural | has_polite_words, has_negation, is_short, is_long |
Keyword lists (WH starters, command verbs, polite modals, negation words) are localised per language. See train/lang/feature_extractors.py for the full definitions.
Usage
ONNX Inference (Recommended)
The ONNX model is a full pipeline — TF-IDF featurisation is embedded. Pass raw text strings directly; no vectorizer setup needed.
import onnxruntime as rt
import json
sess = rt.InferenceSession("sentence_type_EN_0.8.0.onnx")
# Class labels are stored in model metadata
classes = json.loads(sess.get_modelmeta().custom_metadata_map["classes"])
texts = ["Could you pass the salt?", "What time is it?", "Close the door."]
pred_indices = sess.run(None, {"input": texts})[0]
print([classes[i] for i in pred_indices])
# → ['request', 'wh_question', 'command']
License
Apache 2.0
Citation
@dataset{sentence_types_multilingual_2024,
author = {TigreGotico},
title = {Multilingual Sentence-Type Classifiers (6-Class)},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Jarbas/sentence-types}}
}
