Spaces:

FoodDesert
/

Prompt_Squirrel_RAG

Running

App Files Files Community

Prompt_Squirrel_RAG / docs /retrieval_contract.md

Food Desert

Refresh pipeline contract docs to match current runtime behavior

57b7339 12 days ago

preview code

raw

history blame contribute delete

5.89 kB

	# Retrieval Contract - Stage 2 (Candidate Generation / Grounding)

	Stage 2 builds a high-recall candidate pool from a fixed canonical tag vocabulary.
	It does not invent tags and does not make final selection decisions.

	Primary implementation: `psq_rag/retrieval/psq_retrieval.py::psq_candidates_from_rewrite_phrases`.

	---

	## Inputs

	- `rewrite_phrases: Sequence[str]`
	- `allow_nsfw_tags: bool`
	- `context_tags: Optional[Sequence[str]] = None`
	- `context_tag_weight: float = 1.0`
	- `context_weight: float = 0.5`
	- `per_phrase_k: int = 50`
	- `per_phrase_final_k: int = 1`
	- `global_k: int = 300`
	- `min_tag_count: int = 0`
	- `verbose: bool = False`
	- `return_phrase_ranks: bool = False`

	Notes:
	- In app runtime, these are usually overridden by env-backed settings:
	- `PSQ_RETRIEVAL_GLOBAL_K` (default 300)
	- `PSQ_RETRIEVAL_PER_PHRASE_K` (default 10)
	- `PSQ_RETRIEVAL_PER_PHRASE_FINAL_K` (default 1)
	- `PSQ_MIN_TAG_COUNT` (default 100 in app path)
	- Stage 2 may be called with structural/probe tags in `context_tags` to improve context scoring.

	---

	## Phrase Normalization

	Each incoming phrase is normalized to:
	- lowercase
	- underscores converted to spaces
	- trimmed and whitespace-collapsed

	Then phrases are deduplicated in first-seen order.

	---

	## Head-Term Expansion

	For each normalized multi-token phrase, Stage 2 may add the last token ("head") as an extra phrase.

	A head is added only when:
	- phrase has at least 2 tokens
	- head length is at least 3
	- head is not in the built-in stopword list (`and`, `or`, `the`, `a`, `an`, `of`, `to`, `in`, `on`, ...).

	Final phrase list = deduped original phrases + valid heads (deduped again, order-preserving).

	---

	## Canonical Projection Rules

	All candidate generation is projected into canonical tags via `_project_to_canonicals(token)`:

	1. If token itself appears as canonical (`token in tag_counts` or token has TF-IDF row), keep token.
	2. Else if token is an alias (`token in alias2tags`), map to all alias targets.
	3. Else drop token.

	`min_tag_count` filtering is applied during this projection.

	---

	## Per-Phrase Candidate Generation

	For each phrase:

	1. Build `lookup = phrase.replace(" ", "_")`.
	2. Compute `projected_lookup = _project_to_canonicals(lookup)`.
	3. FastText neighbors:
	- normally: `fasttext_model.most_similar(lookup, topn=per_phrase_k)`
	- special case:
	- if `PSQ_SKIP_FASTTEXT_FOR_EXACT_ALIAS` is enabled (default `"1"`) and `projected_lookup` is non-empty, neighbor call is skipped for that phrase.
	4. For each neighbor token:
	- project to canonical tags
	- apply NSFW filter if needed
	- keep best FastText similarity per canonical tag
	- keep `alias_token` that achieved that best similarity.
	5. Exact-match injection:
	- each canonical tag in `projected_lookup` is injected with `score_fasttext = 1.0`
	- `alias_token = lookup`.

	This creates a per-phrase map: `tag -> score_fasttext` (+ best token for debug).

	---

	## Context Vector and Context Scores

	Stage 2 builds one request-level pseudo TF-IDF vector from:
	- normalized final phrases (lookup form, underscore)
	- optional `context_tags` weighted by `context_tag_weight`.

	Vector path:
	- TF-IDF sparse vector -> SVD transform -> L2 normalize.

	If zero norm:
	- `query_has_context = False`
	- all candidates use FastText-only combined score.

	If non-zero:
	- compute cosine vs normalized reduced TF-IDF row vectors for tags that have TF-IDF rows.
	- tags with no TF-IDF row initially get `score_context = None`.

	---

	## Missing-Context Imputation

	When `query_has_context = True`, per phrase:
	- gather non-None context scores for candidates in that phrase
	- `default_context_for_phrase` = 10th percentile (q=0.10) of those scores
	- if none exist, default is `0.0`.
	- tags missing context receive this default and are marked `context_imputed=True` (verbose report only).

	---

	## Score Fusion

	Per phrase candidate:

	- if no context vector:
	- `score_combined = score_fasttext`
	- else:
	- `score_combined = (1 - context_weight) * score_fasttext + context_weight * score_context`

	---

	## Per-Phrase Truncation and Required Tags

	Per phrase:
	- sort by `score_combined` desc
	- keep top `per_phrase_final_k`.

	Required tags:
	- tags from `projected_lookup` for that phrase.
	- each required tag must be present in that phrase's final list.
	- if insertion is needed and list is full, evict lowest-ranked non-required item (or last item if all are required).
	- re-sort by combined score after enforcement.

	---

	## Merge Across Phrases

	All per-phrase survivors are merged into a global map by canonical tag:

	- `sources`: union of phrases that contributed this tag
	- `score_fasttext`: max observed across phrases
	- `score_context`: max observed across phrases (None-aware)
	- `score_combined`: max observed across phrases
	- `count`: from corpus tag counts

	Then:
	- sort by `score_combined` desc
	- cut to `global_k`
	- return ordered `List[Candidate]`.

	---

	## Output Shapes

	`Candidate` fields:
	- `tag: str`
	- `score_combined: float`
	- `score_fasttext: Optional[float]`
	- `score_context: Optional[float]`
	- `count: Optional[int]`
	- `sources: List[str]`

	Return variants:
	- default: `List[Candidate]`
	- verbose: `(List[Candidate], phrase_reports)`
	- return_phrase_ranks only: `(List[Candidate], phrase_rank_by_tag)`
	- verbose + return_phrase_ranks: `(List[Candidate], phrase_reports, phrase_rank_by_tag)`

	`phrase_reports` rows include:
	- `phrase`, `normalized`, `lookup`, `tfidf_vocab`, `oov_terms`
	- candidate rows with `tag`, `alias_token`, `score_fasttext`, `score_context`, `score_combined`, `context_imputed`, `count`.

	---

	## Policy Filtering

	NSFW filtering uses `get_nsfw_tags()` from `psq_rag.retrieval.state`.
	Current source is `word_rating_probabilities.csv` with threshold `0.95`.

	---

	## Stage Boundary

	Stage 2 is retrieval grounding only.
	Final tag choice happens in Stage 3 closed-set selection.