Spaces:

FoodDesert
/

Prompt_Squirrel_RAG

Running

App Files Files Community

Prompt_Squirrel_RAG / docs /retrieval_contract.md

Food Desert

Refresh pipeline contract docs to match current runtime behavior

57b7339 11 days ago

preview code

raw

history blame contribute delete

5.89 kB

A newer version of the Gradio SDK is available: 6.11.0

Upgrade

Retrieval Contract - Stage 2 (Candidate Generation / Grounding)

Stage 2 builds a high-recall candidate pool from a fixed canonical tag vocabulary. It does not invent tags and does not make final selection decisions.

Primary implementation: psq_rag/retrieval/psq_retrieval.py::psq_candidates_from_rewrite_phrases.

Inputs

rewrite_phrases: Sequence[str]
allow_nsfw_tags: bool
context_tags: Optional[Sequence[str]] = None
context_tag_weight: float = 1.0
context_weight: float = 0.5
per_phrase_k: int = 50
per_phrase_final_k: int = 1
global_k: int = 300
min_tag_count: int = 0
verbose: bool = False
return_phrase_ranks: bool = False

Notes:

In app runtime, these are usually overridden by env-backed settings:
- PSQ_RETRIEVAL_GLOBAL_K (default 300)
- PSQ_RETRIEVAL_PER_PHRASE_K (default 10)
- PSQ_RETRIEVAL_PER_PHRASE_FINAL_K (default 1)
- PSQ_MIN_TAG_COUNT (default 100 in app path)
Stage 2 may be called with structural/probe tags in context_tags to improve context scoring.

Phrase Normalization

Each incoming phrase is normalized to:

lowercase
underscores converted to spaces
trimmed and whitespace-collapsed

Then phrases are deduplicated in first-seen order.

Head-Term Expansion

For each normalized multi-token phrase, Stage 2 may add the last token ("head") as an extra phrase.

A head is added only when:

phrase has at least 2 tokens
head length is at least 3
head is not in the built-in stopword list (and, or, the, a, an, of, to, in, on, ...).

Final phrase list = deduped original phrases + valid heads (deduped again, order-preserving).

Canonical Projection Rules

All candidate generation is projected into canonical tags via _project_to_canonicals(token):

If token itself appears as canonical (token in tag_counts or token has TF-IDF row), keep token.
Else if token is an alias (token in alias2tags), map to all alias targets.
Else drop token.

min_tag_count filtering is applied during this projection.

Per-Phrase Candidate Generation

For each phrase:

Build lookup = phrase.replace(" ", "_").
Compute projected_lookup = _project_to_canonicals(lookup).
FastText neighbors:
- normally: fasttext_model.most_similar(lookup, topn=per_phrase_k)
- special case:
  - if PSQ_SKIP_FASTTEXT_FOR_EXACT_ALIAS is enabled (default "1") and projected_lookup is non-empty, neighbor call is skipped for that phrase.
For each neighbor token:
- project to canonical tags
- apply NSFW filter if needed
- keep best FastText similarity per canonical tag
- keep alias_token that achieved that best similarity.
Exact-match injection:
- each canonical tag in projected_lookup is injected with score_fasttext = 1.0
- alias_token = lookup.

This creates a per-phrase map: tag -> score_fasttext (+ best token for debug).

Context Vector and Context Scores

Stage 2 builds one request-level pseudo TF-IDF vector from:

normalized final phrases (lookup form, underscore)
optional context_tags weighted by context_tag_weight.

Vector path:

TF-IDF sparse vector -> SVD transform -> L2 normalize.

If zero norm:

query_has_context = False
all candidates use FastText-only combined score.

If non-zero:

compute cosine vs normalized reduced TF-IDF row vectors for tags that have TF-IDF rows.
tags with no TF-IDF row initially get score_context = None.

Missing-Context Imputation

When query_has_context = True, per phrase:

gather non-None context scores for candidates in that phrase
default_context_for_phrase = 10th percentile (q=0.10) of those scores
- if none exist, default is 0.0.
tags missing context receive this default and are marked context_imputed=True (verbose report only).

Score Fusion

Per phrase candidate:

if no context vector:
- score_combined = score_fasttext
else:
- score_combined = (1 - context_weight) * score_fasttext + context_weight * score_context

Per-Phrase Truncation and Required Tags

Per phrase:

sort by score_combined desc
keep top per_phrase_final_k.

Required tags:

tags from projected_lookup for that phrase.
each required tag must be present in that phrase's final list.
if insertion is needed and list is full, evict lowest-ranked non-required item (or last item if all are required).
re-sort by combined score after enforcement.

Merge Across Phrases

All per-phrase survivors are merged into a global map by canonical tag:

sources: union of phrases that contributed this tag
score_fasttext: max observed across phrases
score_context: max observed across phrases (None-aware)
score_combined: max observed across phrases
count: from corpus tag counts

Then:

sort by score_combined desc
cut to global_k
return ordered List[Candidate].

Output Shapes

Candidate fields:

tag: str
score_combined: float
score_fasttext: Optional[float]
score_context: Optional[float]
count: Optional[int]
sources: List[str]

Return variants:

default: List[Candidate]
verbose: (List[Candidate], phrase_reports)
return_phrase_ranks only: (List[Candidate], phrase_rank_by_tag)
verbose + return_phrase_ranks: (List[Candidate], phrase_reports, phrase_rank_by_tag)

phrase_reports rows include:

phrase, normalized, lookup, tfidf_vocab, oov_terms
candidate rows with tag, alias_token, score_fasttext, score_context, score_combined, context_imputed, count.

Policy Filtering

NSFW filtering uses get_nsfw_tags() from psq_rag.retrieval.state. Current source is word_rating_probabilities.csv with threshold 0.95.

Stage Boundary

Stage 2 is retrieval grounding only. Final tag choice happens in Stage 3 closed-set selection.