You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Colloquial-vs-Technical Query Classifier (SetFit → ONNX)

Binary classifier that decides whether a study query is phrased in colloquial / lay language (everyday English, lay synonyms, partial vocabulary) or technical / canonical language (uses the field's named terminology). Designed to gate query-expansion in Hybrid RAG retrieval: expand BM25 with canonical synonyms when the user queries colloquially, skip expansion when the user is already using the canonical term.


Base model	`sentence-transformers/paraphrase-MiniLM-L3-v2` (17M params, 384-dim, 3 transformer layers)
Training framework	SetFit 1.1.3 (contrastive fine-tuning + sklearn LR head)
Exported as	ONNX (encoder) + JSON (LR head) via `optimum.exporters.onnx`
Held-out test F1	0.994 (P=0.996, R=0.992, acc=0.994 on 480 examples)
Domains covered	24 (STEM + law + humanities + arts)
Training data size	2,399 unique labeled queries (1,919 train / 480 test)
Inference deps	`onnxruntime`, `transformers` (for tokenizer), `numpy` — no torch

Quick start

from pathlib import Path
import json
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

MODEL_DIR = Path("./")

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
encoder = ort.InferenceSession(str(MODEL_DIR / "model.onnx"))
head = json.loads((MODEL_DIR / "classifier_head.json").read_text())

def predict_colloquial_proba(texts):
    enc = tokenizer(texts, padding=True, truncation=True,
                    max_length=128, return_tensors="np")
    feed = {
        "input_ids":      enc["input_ids"].astype(np.int64),
        "attention_mask": enc["attention_mask"].astype(np.int64),
        "token_type_ids": enc.get("token_type_ids",
                                  np.zeros_like(enc["input_ids"])).astype(np.int64),
    }
    last_hidden = encoder.run(None, feed)[0]  # (B, T, 384)

    # Mean-pool, mask-aware. SetFit's pipeline did NOT normalize.
    mask = enc["attention_mask"].astype(np.float32)[..., None]
    pooled = (last_hidden * mask).sum(axis=1) / mask.sum(axis=1).clip(min=1.0)

    # Single-class binary LR head
    coef = np.array(head["coef"])
    intercept = np.array(head["intercept"])
    logits = pooled @ coef.T + intercept
    return 1.0 / (1.0 + np.exp(-logits[:, 0]))

# Demo
samples = [
    "heart attack treatment",            # colloquial
    "STEMI management protocol",         # technical
    "stare decisis",                     # technical
    "can someone sue me for slipping",   # colloquial
    "compute the gradient of f(x,y)",    # technical
    "what's a polygon",                  # colloquial
]
for s, p in zip(samples, predict_colloquial_proba(samples)):
    label = "colloquial" if p >= 0.5 else "technical"
    print(f"  {p:.3f}  {label:<12}  {s!r}")

A complete runnable version is in inference_example.py.

Inference contract

Output: scalar probability P(label == colloquial) per input.
Threshold: 0.50 by default. Higher threshold → stricter (only confidently-colloquial queries flagged).
Label map: {"technical": 0, "colloquial": 1} (see classifier_head.json).

Architecture

input text → BertTokenizer → ONNX encoder (3-layer MiniLM)
                                  ↓
                          last_hidden_state (B, T, 384)
                                  ↓
                          mean-pool (mask-aware, no L2 normalize)
                                  ↓
                          (B, 384) embedding
                                  ↓
                          Logistic Regression (1×384 + 1)
                                  ↓
                          sigmoid → P(colloquial)

The LR head is shipped as raw weights in classifier_head.json (sklearn LogisticRegression.coef_ + .intercept_). Two files for the same model is unusual; this is because the sentence-transformer encoder is widely shared infra (could be swapped) while the LR head is task-specific.

Training data

Generated by 6 parallel agents (one per ~4 domains). 100 queries per domain (50 colloquial + 50 technical), length-varied (short / medium / long, roughly 25/50/25). Examples spanning:

Mathematics      Computer_Science    Physics     Chemistry
Biology          Medicine            Agriculture Earth_and_Environmental_Sciences
Information_Technology               Communications_Journalism_and_Information   Services
Psychology       Social_Sciences     Economics   Business
History          Philosophy          Literature  Religion_and_Theology
Law              Education           Art_and_Design   Music
Engineering

Each example includes:

query — the actual query text
label — colloquial or technical
domain — one of the 24 domains
length — short / medium / long

Labeling conventions:

Colloquial ≠ casual tone. Means the user uses lay vocabulary in place of the field's canonical term: "heart attack" (not "myocardial infarction"), "clogged arteries" (not "atherosclerosis"), "irregular heartbeat" (not "arrhythmia").
Technical includes anything that names a canonical concept, drug, theorem, doctrine, etc. — even if short or commonly known ("Maxwell's equations", "stare decisis", "Bloom's taxonomy", "ICD-10 codes").
Mixed register with any canonical term as the primary subject → technical. Only label colloquial when the primary subject is described in lay terms.

Performance

Held-out test set (480 examples, balanced):

Metric	Value
Accuracy	0.994
Precision (colloquial)	0.996
Recall (colloquial)	0.992
F1 (colloquial)	0.994

Per-domain F1 (240 colloquial + 240 technical, 10 of each per domain in test split):

Domain	F1	Notes
Agriculture_and_Veterinary	1.000
Art_and_Design	1.000
Biology	1.000
Business	1.000
Chemistry	0.889	2 colloquial misclassified as technical
Communications_Journalism_and_Information	1.000
Computer_Science	1.000
Earth_and_Environmental_Sciences	1.000
Economics	1.000
Education	1.000
Engineering	1.000
History	1.000
Information_Technology	1.000
Law	1.000
Literature	1.000
Mathematics	1.000
Medicine	1.000
Music	1.000
Philosophy	1.000
Physics	0.952	1 technical misclassified as colloquial
Psychology	1.000
Religion_and_Theology	1.000
Services	1.000
Social_Sciences	1.000

Intended use

Designed as a retrieval-time gate for BM25 query expansion in Hybrid RAG systems. Concrete pattern:

if classifier.is_colloquial(query):
    expansion_terms = nearest_canonical_concepts(query_embedding)
    bm25_query = original_query + " " + expansion_terms  # weighted
else:
    bm25_query = original_query

The classifier is not a general "register detector" — it was specifically trained to capture the lay-vs-canonical distinction in study queries, where the field's canonical vocabulary is the relevant axis.

Out-of-scope / limitations

Out-of-domain: trained on study-query language across 24 university-level academic domains. Queries from other registers (legal briefs in their entirety, conversational chat, code snippets) are out of distribution.
Non-English: English-only. Other languages will likely classify near 0.5 (uncertain).
Adversarial inputs: the classifier doesn't refuse — it always outputs a probability. Pair with sanity checks (length, language detection) for production use.
"What is X" queries with canonical X: training data tended to label "what is X" patterns as colloquial regardless of how technical X is. If your downstream task disagrees, retrain with reweighted examples.
Long queries: training distribution had ~25% long examples (16+ words). Very long inputs (>128 tokens) get truncated.

Caveats from training

Encoder is trained on this task — not a frozen feature extractor. The contrastive SetFit step fine-tuned the encoder weights. Don't substitute the base paraphrase-MiniLM-L3-v2 encoder; use the ONNX exported here.
No normalization in the inference pipeline. The LR head was trained on raw mean-pooled vectors. Adding L2 normalization will break predictions.
Tokenizer must match. Use the tokenizer files in this repo, not a freshly-downloaded paraphrase-MiniLM-L3-v2 tokenizer (vocab and special tokens are identical, but defensively pin to this repo's files).

Files

File	Purpose	Size
`model.onnx`	Sentence-encoder ONNX (3-layer MiniLM, batched, dynamic length)	66 MB
`config.json`	Encoder config (Bert architecture)	1 KB
`tokenizer.json`	Fast Bert WordPiece tokenizer	695 KB
`vocab.txt`	Vocab (30,522 tokens)	226 KB
`tokenizer_config.json`, `special_tokens_map.json`	Tokenizer metadata	< 2 KB
`classifier_head.json`	sklearn LR head (coef 1×384 + intercept) + label map	11 KB
`training_metrics.json`	Evaluation results from training run	1 KB
`inference_example.py`	Runnable end-to-end demo	—

Citation / acknowledgments

Base encoder: sentence-transformers/paraphrase-MiniLM-L3-v2 (Reimers & Gurevych, 2019). Training framework: SetFit (Tunstall et al., 2022). Built for the EXAMI study app's Hybrid RAG retrieval pipeline.

Downloads last month: -