Multilingual Sentence-Type Classifiers (6-Class)

A collection of 7 multilingual text classification models trained to classify sentences into 6 semantic-syntactic types. Each model is exported as ONNX for cross-platform inference.

Model Details

Task: Multi-class text classification (6 classes) Architecture: TF-IDF word features (unigrams + bigrams) + language-specific categorical features → LinearSVC Export Format: ONNX (TF-IDF + LinearSVC core) + scikit-learn pickle (full pipeline with categorical features) Training Framework: scikit-learn Test set performance: 94% average macro F1 across 7 languages (v0.8.0 with categorical features)

Languages & Performance

Performance on 15% held-out test split (1,485 samples per language, perfectly balanced).

Language Code TF-IDF only (F1) + Categorical (F1) Delta
English en 0.95 0.96 +0.01
French fr 0.94 0.94
Dutch nl 0.93 0.93
German de 0.92 0.96 +0.04
Spanish es 0.90 0.95 +0.05
Portuguese pt 0.90 0.94 +0.04
Italian it 0.90 0.94 +0.04

Categorical feature impact

Categorical features bring measurable gains for DE, ES, IT and PT (+4–5pp). FR and NL rely more on word-order and morphology already captured by TF-IDF bigrams, so those two see no benefit from the rule-based feature layer.

Label Taxonomy (6 Classes)

Label Definition Examples
command Direct imperative with no polite framing. Verb-initial constructions ordering action. "Close the door", "Stop talking", "Send me the file"
exclamation Expressive, emphatic sentences conveying emotion or emphasis, often with "What a…!" or "How…!" constructions. "What a beautiful sunset!", "How wonderful!", "That's incredible!"
polar_question Yes/no questions seeking binary affirmation or negation, typically via auxiliary inversion or modal forms. "Do you like coffee?", "Can you help me?", "Is it raining?"
request Polite ask using conditional or modal forms ("Could you", "Would you", "Can you", "May I", "Might I"). Frames action as option rather than command. "Could you pass the salt?", "Would you mind closing the window?", "May I borrow your pen?"
statement Declarative sentences reporting facts, states, or observations with no interrogative or imperative structure. "The Earth orbits the Sun", "I live in Paris", "She is a doctor"
wh_question Open-ended information-seeking questions using wh-words (Who, What, When, Where, Why, How). Expects substantive answer, not binary response. "Where are you from?", "What time is it?", "How does photosynthesis work?"

Dataset & Training

Training Dataset: TigreGotico/sentence-types-multilingual

  • Size: 69,300 sentences (9,900 per language)
  • Split: 85% train (8,415 per language), 15% test (1,485 per language)
  • Balance: 1,650 examples per class per language (perfectly balanced)
  • Languages: English, Spanish, French, German, Italian, Portuguese, Dutch

Methodology:

  1. Expanded English dataset with hand-authored examples covering diverse contexts
  2. Applied rule-based label validation to fix classification drift
  3. Translated to 7 languages using Tower-Plus-2B-GGUF (q4_k_m quantization)
  4. Trained separate LinearSVC model per language (no transfer learning)
  5. Added language-specific categorical features (18 per language: intent signals, punctuation counts, lexical metrics, negation) — mutual information pruning kept only features with MI ≥ 0.01 bits
  6. Exported TF-IDF + LinearSVC core to ONNX via skl2onnx; full pipeline (including categorical features) available as pickle

Categorical Features

Each language model uses 18 hand-engineered features alongside TF-IDF:

Group Features
Intent signals starts_wh, starts_polite, starts_command, starts_exclamation
Punctuation ends_question, ends_exclamation, ends_period, question_mark_count, exclamation_mark_count, ellipsis_count
Lexical sentence_length, unique_token_count, lexical_diversity, avg_token_length
Structural has_polite_words, has_negation, is_short, is_long

Keyword lists (WH starters, command verbs, polite modals, negation words) are localised per language. See train/lang/feature_extractors.py for the full definitions.

Usage

ONNX Inference (Recommended)

The ONNX model is a full pipeline — TF-IDF featurisation is embedded. Pass raw text strings directly; no vectorizer setup needed.

import onnxruntime as rt
import json

sess = rt.InferenceSession("sentence_type_EN_0.8.0.onnx")

# Class labels are stored in model metadata
classes = json.loads(sess.get_modelmeta().custom_metadata_map["classes"])

texts = ["Could you pass the salt?", "What time is it?", "Close the door."]
pred_indices = sess.run(None, {"input": texts})[0]
print([classes[i] for i in pred_indices])
# → ['request', 'wh_question', 'command']

License

Apache 2.0

Citation

@dataset{sentence_types_multilingual_2024,
  author = {TigreGotico},
  title = {Multilingual Sentence-Type Classifiers (6-Class)},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Jarbas/sentence-types}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train TigreGotico/sentence-types