Prompt_Squirrel_RAG / docs /retrieval_contract.md
Food Desert
Refresh pipeline contract docs to match current runtime behavior
57b7339

A newer version of the Gradio SDK is available: 6.11.0

Upgrade

Retrieval Contract - Stage 2 (Candidate Generation / Grounding)

Stage 2 builds a high-recall candidate pool from a fixed canonical tag vocabulary. It does not invent tags and does not make final selection decisions.

Primary implementation: psq_rag/retrieval/psq_retrieval.py::psq_candidates_from_rewrite_phrases.


Inputs

  • rewrite_phrases: Sequence[str]
  • allow_nsfw_tags: bool
  • context_tags: Optional[Sequence[str]] = None
  • context_tag_weight: float = 1.0
  • context_weight: float = 0.5
  • per_phrase_k: int = 50
  • per_phrase_final_k: int = 1
  • global_k: int = 300
  • min_tag_count: int = 0
  • verbose: bool = False
  • return_phrase_ranks: bool = False

Notes:

  • In app runtime, these are usually overridden by env-backed settings:
    • PSQ_RETRIEVAL_GLOBAL_K (default 300)
    • PSQ_RETRIEVAL_PER_PHRASE_K (default 10)
    • PSQ_RETRIEVAL_PER_PHRASE_FINAL_K (default 1)
    • PSQ_MIN_TAG_COUNT (default 100 in app path)
  • Stage 2 may be called with structural/probe tags in context_tags to improve context scoring.

Phrase Normalization

Each incoming phrase is normalized to:

  • lowercase
  • underscores converted to spaces
  • trimmed and whitespace-collapsed

Then phrases are deduplicated in first-seen order.


Head-Term Expansion

For each normalized multi-token phrase, Stage 2 may add the last token ("head") as an extra phrase.

A head is added only when:

  • phrase has at least 2 tokens
  • head length is at least 3
  • head is not in the built-in stopword list (and, or, the, a, an, of, to, in, on, ...).

Final phrase list = deduped original phrases + valid heads (deduped again, order-preserving).


Canonical Projection Rules

All candidate generation is projected into canonical tags via _project_to_canonicals(token):

  1. If token itself appears as canonical (token in tag_counts or token has TF-IDF row), keep token.
  2. Else if token is an alias (token in alias2tags), map to all alias targets.
  3. Else drop token.

min_tag_count filtering is applied during this projection.


Per-Phrase Candidate Generation

For each phrase:

  1. Build lookup = phrase.replace(" ", "_").
  2. Compute projected_lookup = _project_to_canonicals(lookup).
  3. FastText neighbors:
    • normally: fasttext_model.most_similar(lookup, topn=per_phrase_k)
    • special case:
      • if PSQ_SKIP_FASTTEXT_FOR_EXACT_ALIAS is enabled (default "1") and projected_lookup is non-empty, neighbor call is skipped for that phrase.
  4. For each neighbor token:
    • project to canonical tags
    • apply NSFW filter if needed
    • keep best FastText similarity per canonical tag
    • keep alias_token that achieved that best similarity.
  5. Exact-match injection:
    • each canonical tag in projected_lookup is injected with score_fasttext = 1.0
    • alias_token = lookup.

This creates a per-phrase map: tag -> score_fasttext (+ best token for debug).


Context Vector and Context Scores

Stage 2 builds one request-level pseudo TF-IDF vector from:

  • normalized final phrases (lookup form, underscore)
  • optional context_tags weighted by context_tag_weight.

Vector path:

  • TF-IDF sparse vector -> SVD transform -> L2 normalize.

If zero norm:

  • query_has_context = False
  • all candidates use FastText-only combined score.

If non-zero:

  • compute cosine vs normalized reduced TF-IDF row vectors for tags that have TF-IDF rows.
  • tags with no TF-IDF row initially get score_context = None.

Missing-Context Imputation

When query_has_context = True, per phrase:

  • gather non-None context scores for candidates in that phrase
  • default_context_for_phrase = 10th percentile (q=0.10) of those scores
    • if none exist, default is 0.0.
  • tags missing context receive this default and are marked context_imputed=True (verbose report only).

Score Fusion

Per phrase candidate:

  • if no context vector:
    • score_combined = score_fasttext
  • else:
    • score_combined = (1 - context_weight) * score_fasttext + context_weight * score_context

Per-Phrase Truncation and Required Tags

Per phrase:

  • sort by score_combined desc
  • keep top per_phrase_final_k.

Required tags:

  • tags from projected_lookup for that phrase.
  • each required tag must be present in that phrase's final list.
  • if insertion is needed and list is full, evict lowest-ranked non-required item (or last item if all are required).
  • re-sort by combined score after enforcement.

Merge Across Phrases

All per-phrase survivors are merged into a global map by canonical tag:

  • sources: union of phrases that contributed this tag
  • score_fasttext: max observed across phrases
  • score_context: max observed across phrases (None-aware)
  • score_combined: max observed across phrases
  • count: from corpus tag counts

Then:

  • sort by score_combined desc
  • cut to global_k
  • return ordered List[Candidate].

Output Shapes

Candidate fields:

  • tag: str
  • score_combined: float
  • score_fasttext: Optional[float]
  • score_context: Optional[float]
  • count: Optional[int]
  • sources: List[str]

Return variants:

  • default: List[Candidate]
  • verbose: (List[Candidate], phrase_reports)
  • return_phrase_ranks only: (List[Candidate], phrase_rank_by_tag)
  • verbose + return_phrase_ranks: (List[Candidate], phrase_reports, phrase_rank_by_tag)

phrase_reports rows include:

  • phrase, normalized, lookup, tfidf_vocab, oov_terms
  • candidate rows with tag, alias_token, score_fasttext, score_context, score_combined, context_imputed, count.

Policy Filtering

NSFW filtering uses get_nsfw_tags() from psq_rag.retrieval.state. Current source is word_rating_probabilities.csv with threshold 0.95.


Stage Boundary

Stage 2 is retrieval grounding only. Final tag choice happens in Stage 3 closed-set selection.