Spaces:
Running
A newer version of the Gradio SDK is available: 6.11.0
Retrieval Contract - Stage 2 (Candidate Generation / Grounding)
Stage 2 builds a high-recall candidate pool from a fixed canonical tag vocabulary. It does not invent tags and does not make final selection decisions.
Primary implementation: psq_rag/retrieval/psq_retrieval.py::psq_candidates_from_rewrite_phrases.
Inputs
rewrite_phrases: Sequence[str]allow_nsfw_tags: boolcontext_tags: Optional[Sequence[str]] = Nonecontext_tag_weight: float = 1.0context_weight: float = 0.5per_phrase_k: int = 50per_phrase_final_k: int = 1global_k: int = 300min_tag_count: int = 0verbose: bool = Falsereturn_phrase_ranks: bool = False
Notes:
- In app runtime, these are usually overridden by env-backed settings:
PSQ_RETRIEVAL_GLOBAL_K(default 300)PSQ_RETRIEVAL_PER_PHRASE_K(default 10)PSQ_RETRIEVAL_PER_PHRASE_FINAL_K(default 1)PSQ_MIN_TAG_COUNT(default 100 in app path)
- Stage 2 may be called with structural/probe tags in
context_tagsto improve context scoring.
Phrase Normalization
Each incoming phrase is normalized to:
- lowercase
- underscores converted to spaces
- trimmed and whitespace-collapsed
Then phrases are deduplicated in first-seen order.
Head-Term Expansion
For each normalized multi-token phrase, Stage 2 may add the last token ("head") as an extra phrase.
A head is added only when:
- phrase has at least 2 tokens
- head length is at least 3
- head is not in the built-in stopword list (
and,or,the,a,an,of,to,in,on, ...).
Final phrase list = deduped original phrases + valid heads (deduped again, order-preserving).
Canonical Projection Rules
All candidate generation is projected into canonical tags via _project_to_canonicals(token):
- If token itself appears as canonical (
token in tag_countsor token has TF-IDF row), keep token. - Else if token is an alias (
token in alias2tags), map to all alias targets. - Else drop token.
min_tag_count filtering is applied during this projection.
Per-Phrase Candidate Generation
For each phrase:
- Build
lookup = phrase.replace(" ", "_"). - Compute
projected_lookup = _project_to_canonicals(lookup). - FastText neighbors:
- normally:
fasttext_model.most_similar(lookup, topn=per_phrase_k) - special case:
- if
PSQ_SKIP_FASTTEXT_FOR_EXACT_ALIASis enabled (default"1") andprojected_lookupis non-empty, neighbor call is skipped for that phrase.
- if
- normally:
- For each neighbor token:
- project to canonical tags
- apply NSFW filter if needed
- keep best FastText similarity per canonical tag
- keep
alias_tokenthat achieved that best similarity.
- Exact-match injection:
- each canonical tag in
projected_lookupis injected withscore_fasttext = 1.0 alias_token = lookup.
- each canonical tag in
This creates a per-phrase map: tag -> score_fasttext (+ best token for debug).
Context Vector and Context Scores
Stage 2 builds one request-level pseudo TF-IDF vector from:
- normalized final phrases (lookup form, underscore)
- optional
context_tagsweighted bycontext_tag_weight.
Vector path:
- TF-IDF sparse vector -> SVD transform -> L2 normalize.
If zero norm:
query_has_context = False- all candidates use FastText-only combined score.
If non-zero:
- compute cosine vs normalized reduced TF-IDF row vectors for tags that have TF-IDF rows.
- tags with no TF-IDF row initially get
score_context = None.
Missing-Context Imputation
When query_has_context = True, per phrase:
- gather non-None context scores for candidates in that phrase
default_context_for_phrase= 10th percentile (q=0.10) of those scores- if none exist, default is
0.0.
- if none exist, default is
- tags missing context receive this default and are marked
context_imputed=True(verbose report only).
Score Fusion
Per phrase candidate:
- if no context vector:
score_combined = score_fasttext
- else:
score_combined = (1 - context_weight) * score_fasttext + context_weight * score_context
Per-Phrase Truncation and Required Tags
Per phrase:
- sort by
score_combineddesc - keep top
per_phrase_final_k.
Required tags:
- tags from
projected_lookupfor that phrase. - each required tag must be present in that phrase's final list.
- if insertion is needed and list is full, evict lowest-ranked non-required item (or last item if all are required).
- re-sort by combined score after enforcement.
Merge Across Phrases
All per-phrase survivors are merged into a global map by canonical tag:
sources: union of phrases that contributed this tagscore_fasttext: max observed across phrasesscore_context: max observed across phrases (None-aware)score_combined: max observed across phrasescount: from corpus tag counts
Then:
- sort by
score_combineddesc - cut to
global_k - return ordered
List[Candidate].
Output Shapes
Candidate fields:
tag: strscore_combined: floatscore_fasttext: Optional[float]score_context: Optional[float]count: Optional[int]sources: List[str]
Return variants:
- default:
List[Candidate] - verbose:
(List[Candidate], phrase_reports) - return_phrase_ranks only:
(List[Candidate], phrase_rank_by_tag) - verbose + return_phrase_ranks:
(List[Candidate], phrase_reports, phrase_rank_by_tag)
phrase_reports rows include:
phrase,normalized,lookup,tfidf_vocab,oov_terms- candidate rows with
tag,alias_token,score_fasttext,score_context,score_combined,context_imputed,count.
Policy Filtering
NSFW filtering uses get_nsfw_tags() from psq_rag.retrieval.state.
Current source is word_rating_probabilities.csv with threshold 0.95.
Stage Boundary
Stage 2 is retrieval grounding only. Final tag choice happens in Stage 3 closed-set selection.