Instructions to use Bei0001/colloquial-query-classifier-setfit-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- setfit
How to use Bei0001/colloquial-query-classifier-setfit-onnx with setfit:
from setfit import SetFitModel model = SetFitModel.from_pretrained("Bei0001/colloquial-query-classifier-setfit-onnx") - Notebooks
- Google Colab
- Kaggle
Colloquial-vs-Technical Query Classifier (SetFit β ONNX)
Binary classifier that decides whether a study query is phrased in colloquial / lay language (everyday English, lay synonyms, partial vocabulary) or technical / canonical language (uses the field's named terminology). Designed to gate query-expansion in Hybrid RAG retrieval: expand BM25 with canonical synonyms when the user queries colloquially, skip expansion when the user is already using the canonical term.
| Base model | sentence-transformers/paraphrase-MiniLM-L3-v2 (17M params, 384-dim, 3 transformer layers) |
| Training framework | SetFit 1.1.3 (contrastive fine-tuning + sklearn LR head) |
| Exported as | ONNX (encoder) + JSON (LR head) via optimum.exporters.onnx |
| Held-out test F1 | 0.994 (P=0.996, R=0.992, acc=0.994 on 480 examples) |
| Domains covered | 24 (STEM + law + humanities + arts) |
| Training data size | 2,399 unique labeled queries (1,919 train / 480 test) |
| Inference deps | onnxruntime, transformers (for tokenizer), numpy β no torch |
Quick start
from pathlib import Path
import json
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
MODEL_DIR = Path("./")
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
encoder = ort.InferenceSession(str(MODEL_DIR / "model.onnx"))
head = json.loads((MODEL_DIR / "classifier_head.json").read_text())
def predict_colloquial_proba(texts):
enc = tokenizer(texts, padding=True, truncation=True,
max_length=128, return_tensors="np")
feed = {
"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64),
"token_type_ids": enc.get("token_type_ids",
np.zeros_like(enc["input_ids"])).astype(np.int64),
}
last_hidden = encoder.run(None, feed)[0] # (B, T, 384)
# Mean-pool, mask-aware. SetFit's pipeline did NOT normalize.
mask = enc["attention_mask"].astype(np.float32)[..., None]
pooled = (last_hidden * mask).sum(axis=1) / mask.sum(axis=1).clip(min=1.0)
# Single-class binary LR head
coef = np.array(head["coef"])
intercept = np.array(head["intercept"])
logits = pooled @ coef.T + intercept
return 1.0 / (1.0 + np.exp(-logits[:, 0]))
# Demo
samples = [
"heart attack treatment", # colloquial
"STEMI management protocol", # technical
"stare decisis", # technical
"can someone sue me for slipping", # colloquial
"compute the gradient of f(x,y)", # technical
"what's a polygon", # colloquial
]
for s, p in zip(samples, predict_colloquial_proba(samples)):
label = "colloquial" if p >= 0.5 else "technical"
print(f" {p:.3f} {label:<12} {s!r}")
A complete runnable version is in inference_example.py.
Inference contract
- Output: scalar probability
P(label == colloquial)per input. - Threshold: 0.50 by default. Higher threshold β stricter (only confidently-colloquial queries flagged).
- Label map:
{"technical": 0, "colloquial": 1}(seeclassifier_head.json).
Architecture
input text β BertTokenizer β ONNX encoder (3-layer MiniLM)
β
last_hidden_state (B, T, 384)
β
mean-pool (mask-aware, no L2 normalize)
β
(B, 384) embedding
β
Logistic Regression (1Γ384 + 1)
β
sigmoid β P(colloquial)
The LR head is shipped as raw weights in classifier_head.json (sklearn LogisticRegression.coef_ + .intercept_). Two files for the same model is unusual; this is because the sentence-transformer encoder is widely shared infra (could be swapped) while the LR head is task-specific.
Training data
Generated by 6 parallel agents (one per ~4 domains). 100 queries per domain (50 colloquial + 50 technical), length-varied (short / medium / long, roughly 25/50/25). Examples spanning:
Mathematics Computer_Science Physics Chemistry
Biology Medicine Agriculture Earth_and_Environmental_Sciences
Information_Technology Communications_Journalism_and_Information Services
Psychology Social_Sciences Economics Business
History Philosophy Literature Religion_and_Theology
Law Education Art_and_Design Music
Engineering
Each example includes:
queryβ the actual query textlabelβcolloquialortechnicaldomainβ one of the 24 domainslengthβshort/medium/long
Labeling conventions:
- Colloquial β casual tone. Means the user uses lay vocabulary in place of the field's canonical term: "heart attack" (not "myocardial infarction"), "clogged arteries" (not "atherosclerosis"), "irregular heartbeat" (not "arrhythmia").
- Technical includes anything that names a canonical concept, drug, theorem, doctrine, etc. β even if short or commonly known ("Maxwell's equations", "stare decisis", "Bloom's taxonomy", "ICD-10 codes").
- Mixed register with any canonical term as the primary subject β technical. Only label colloquial when the primary subject is described in lay terms.
Performance
Held-out test set (480 examples, balanced):
| Metric | Value |
|---|---|
| Accuracy | 0.994 |
| Precision (colloquial) | 0.996 |
| Recall (colloquial) | 0.992 |
| F1 (colloquial) | 0.994 |
Per-domain F1 (240 colloquial + 240 technical, 10 of each per domain in test split):
| Domain | F1 | Notes |
|---|---|---|
| Agriculture_and_Veterinary | 1.000 | |
| Art_and_Design | 1.000 | |
| Biology | 1.000 | |
| Business | 1.000 | |
| Chemistry | 0.889 | 2 colloquial misclassified as technical |
| Communications_Journalism_and_Information | 1.000 | |
| Computer_Science | 1.000 | |
| Earth_and_Environmental_Sciences | 1.000 | |
| Economics | 1.000 | |
| Education | 1.000 | |
| Engineering | 1.000 | |
| History | 1.000 | |
| Information_Technology | 1.000 | |
| Law | 1.000 | |
| Literature | 1.000 | |
| Mathematics | 1.000 | |
| Medicine | 1.000 | |
| Music | 1.000 | |
| Philosophy | 1.000 | |
| Physics | 0.952 | 1 technical misclassified as colloquial |
| Psychology | 1.000 | |
| Religion_and_Theology | 1.000 | |
| Services | 1.000 | |
| Social_Sciences | 1.000 |
Intended use
Designed as a retrieval-time gate for BM25 query expansion in Hybrid RAG systems. Concrete pattern:
if classifier.is_colloquial(query):
expansion_terms = nearest_canonical_concepts(query_embedding)
bm25_query = original_query + " " + expansion_terms # weighted
else:
bm25_query = original_query
The classifier is not a general "register detector" β it was specifically trained to capture the lay-vs-canonical distinction in study queries, where the field's canonical vocabulary is the relevant axis.
Out-of-scope / limitations
- Out-of-domain: trained on study-query language across 24 university-level academic domains. Queries from other registers (legal briefs in their entirety, conversational chat, code snippets) are out of distribution.
- Non-English: English-only. Other languages will likely classify near 0.5 (uncertain).
- Adversarial inputs: the classifier doesn't refuse β it always outputs a probability. Pair with sanity checks (length, language detection) for production use.
- "What is X" queries with canonical X: training data tended to label "what is X" patterns as colloquial regardless of how technical X is. If your downstream task disagrees, retrain with reweighted examples.
- Long queries: training distribution had ~25% long examples (16+ words). Very long inputs (>128 tokens) get truncated.
Caveats from training
- Encoder is trained on this task β not a frozen feature extractor. The contrastive SetFit step fine-tuned the encoder weights. Don't substitute the base
paraphrase-MiniLM-L3-v2encoder; use the ONNX exported here. - No normalization in the inference pipeline. The LR head was trained on raw mean-pooled vectors. Adding L2 normalization will break predictions.
- Tokenizer must match. Use the tokenizer files in this repo, not a freshly-downloaded
paraphrase-MiniLM-L3-v2tokenizer (vocab and special tokens are identical, but defensively pin to this repo's files).
Files
| File | Purpose | Size |
|---|---|---|
model.onnx |
Sentence-encoder ONNX (3-layer MiniLM, batched, dynamic length) | 66 MB |
config.json |
Encoder config (Bert architecture) | 1 KB |
tokenizer.json |
Fast Bert WordPiece tokenizer | 695 KB |
vocab.txt |
Vocab (30,522 tokens) | 226 KB |
tokenizer_config.json, special_tokens_map.json |
Tokenizer metadata | < 2 KB |
classifier_head.json |
sklearn LR head (coef 1Γ384 + intercept) + label map | 11 KB |
training_metrics.json |
Evaluation results from training run | 1 KB |
inference_example.py |
Runnable end-to-end demo | β |
Citation / acknowledgments
Base encoder: sentence-transformers/paraphrase-MiniLM-L3-v2 (Reimers & Gurevych, 2019).
Training framework: SetFit (Tunstall et al., 2022).
Built for the EXAMI study app's Hybrid RAG retrieval pipeline.
- Downloads last month
- 1