crf-query-xtract — multilingual search-term extractor

Per-language CRF models that label, in a voice-assistant query, the search term — the minimal topic string to hand to a knowledge base or search engine. Given "what is the speed of light?" → "the speed of light"; a command with no topic ("set volume to fifty") → "".

One kx_<lang>.pkl per language for ca da de en es eu fr gl it nl pt. Load them through the crf_query_xtract package, which downloads from this repo on first use.

How to use

pip install crf_query_xtract

from crf_query_xtract import SearchtermExtractorCRF

kx = SearchtermExtractorCRF.from_pretrained("en")   # downloads kx_en.pkl from here, cached
kx.extract_keyword("what is the speed of light")    # 'the speed of light'

# your own / a fork:
kx = SearchtermExtractorCRF.from_pretrained("en", repo_id="me/my-models")
kx = SearchtermExtractorCRF.from_pretrained("en", repo_id="/path/to/local/dir")

Architecture

A single sklearn_crfsuite.CRF per language — no POS tagger, no deep learning. Tokenise with the quebra_frases regex tokenizer; describe each token with cheap orthographic features (lowercased form, 2/3-char prefixes/suffixes, word shape, case/digit flags, the ±2 neighbour tokens, BOS/EOS); predict O/B-KW/I-KW and join contiguous keyword tokens. CPU-friendly, millisecond inference. An ablation found Brill POS features add no accuracy, so they were dropped.

Evaluation

Scored on the gold split of the training dataset. The extractor runs behind an intent gate, so the headline is the in-scope subset (utterances that contain a search term): exact whole-keyword match / token F1, plus negative-rejection on the rest.

lang	exact	F1	neg-reject
ca	0.81	0.91	0.88
da	0.82	0.93	0.90
de	0.76	0.90	0.88
en	0.77	0.90	0.86
es	0.76	0.91	0.85
eu	0.48	0.71	0.99
fr	0.81	0.92	0.89
gl	0.83	0.95	0.97
it	0.76	0.89	0.83
nl	0.82	0.93	0.89
pt	0.79	0.90	0.89

eu is the weak spot (thin training data). The model has no forced fallback, so it returns "" when it labels no keyword.

Intended use & limitations

Built to sit between intent classification and a search/KB backend in OVOS common-query / DuckDuckGo / Wikipedia skills. It assumes its input is already a search query (it is not an intent classifier). Trained largely on templated + silver (LLM-labelled) data — see the dataset card for provenance and caveats.

Training

Reproducible from the dataset with train/train_from_dataset.py in the package repo. Training data: TigreGotico/search-term-extraction (Apache-2.0).

Downloads last month: -; Downloads are not tracked for this model. How to track

TigreGotico
/

crf-query-xtract