Multilingual Sentence-Type Classifiers (6-Class)

A collection of 7 multilingual text classification models trained to classify sentences into 6 semantic-syntactic types. Each model is exported as ONNX for cross-platform inference.

Model Details

Task: Multi-class text classification (6 classes) Architecture: TF-IDF word features (unigrams + bigrams) + language-specific categorical features → LinearSVC Export Format: ONNX (TF-IDF + LinearSVC core) + scikit-learn pickle (full pipeline with categorical features) Training Framework: scikit-learn Test set performance: 94% average macro F1 across 7 languages (v0.8.0 with categorical features)

Languages & Performance

Performance on 15% held-out test split (1,485 samples per language, perfectly balanced).

Language	Code	TF-IDF only (F1)	+ Categorical (F1)	Delta
English	en	0.95	0.96	+0.01
French	fr	0.94	0.94	—
Dutch	nl	0.93	0.93	—
German	de	0.92	0.96	+0.04
Spanish	es	0.90	0.95	+0.05
Portuguese	pt	0.90	0.94	+0.04
Italian	it	0.90	0.94	+0.04

Categorical features bring measurable gains for DE, ES, IT and PT (+4–5pp). FR and NL rely more on word-order and morphology already captured by TF-IDF bigrams, so those two see no benefit from the rule-based feature layer.

Label Taxonomy (6 Classes)

Label	Definition	Examples
`command`	Direct imperative with no polite framing. Verb-initial constructions ordering action.	"Close the door", "Stop talking", "Send me the file"
`exclamation`	Expressive, emphatic sentences conveying emotion or emphasis, often with "What a…!" or "How…!" constructions.	"What a beautiful sunset!", "How wonderful!", "That's incredible!"
`polar_question`	Yes/no questions seeking binary affirmation or negation, typically via auxiliary inversion or modal forms.	"Do you like coffee?", "Can you help me?", "Is it raining?"
`request`	Polite ask using conditional or modal forms ("Could you", "Would you", "Can you", "May I", "Might I"). Frames action as option rather than command.	"Could you pass the salt?", "Would you mind closing the window?", "May I borrow your pen?"
`statement`	Declarative sentences reporting facts, states, or observations with no interrogative or imperative structure.	"The Earth orbits the Sun", "I live in Paris", "She is a doctor"
`wh_question`	Open-ended information-seeking questions using wh-words (Who, What, When, Where, Why, How). Expects substantive answer, not binary response.	"Where are you from?", "What time is it?", "How does photosynthesis work?"

Dataset & Training

Training Dataset: TigreGotico/sentence-types-multilingual

Size: 69,300 sentences (9,900 per language)
Split: 85% train (8,415 per language), 15% test (1,485 per language)
Balance: 1,650 examples per class per language (perfectly balanced)
Languages: English, Spanish, French, German, Italian, Portuguese, Dutch

Methodology:

Expanded English dataset with hand-authored examples covering diverse contexts
Applied rule-based label validation to fix classification drift
Translated to 7 languages using Tower-Plus-2B-GGUF (q4_k_m quantization)
Trained separate LinearSVC model per language (no transfer learning)
Added language-specific categorical features (18 per language: intent signals, punctuation counts, lexical metrics, negation) — mutual information pruning kept only features with MI ≥ 0.01 bits
Exported TF-IDF + LinearSVC core to ONNX via skl2onnx; full pipeline (including categorical features) available as pickle

Categorical Features

Each language model uses 18 hand-engineered features alongside TF-IDF:

Group	Features
Intent signals	`starts_wh`, `starts_polite`, `starts_command`, `starts_exclamation`
Punctuation	`ends_question`, `ends_exclamation`, `ends_period`, `question_mark_count`, `exclamation_mark_count`, `ellipsis_count`
Lexical	`sentence_length`, `unique_token_count`, `lexical_diversity`, `avg_token_length`
Structural	`has_polite_words`, `has_negation`, `is_short`, `is_long`

Keyword lists (WH starters, command verbs, polite modals, negation words) are localised per language. See train/lang/feature_extractors.py for the full definitions.

Usage

ONNX Inference (Recommended)

The ONNX model is a full pipeline — TF-IDF featurisation is embedded. Pass raw text strings directly; no vectorizer setup needed.

import onnxruntime as rt
import json

sess = rt.InferenceSession("sentence_type_EN_0.8.0.onnx")

# Class labels are stored in model metadata
classes = json.loads(sess.get_modelmeta().custom_metadata_map["classes"])

texts = ["Could you pass the salt?", "What time is it?", "Close the door."]
pred_indices = sess.run(None, {"input": texts})[0]
print([classes[i] for i in pred_indices])
# → ['request', 'wh_question', 'command']

License

Apache 2.0

Citation

@dataset{sentence_types_multilingual_2024,
  author = {TigreGotico},
  title = {Multilingual Sentence-Type Classifiers (6-Class)},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Jarbas/sentence-types}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

TigreGotico
/

sentence-types