Spaces:
Running
Running
| # Retrieval Contract - Stage 2 (Candidate Generation / Grounding) | |
| Stage 2 builds a high-recall candidate pool from a fixed canonical tag vocabulary. | |
| It does not invent tags and does not make final selection decisions. | |
| Primary implementation: `psq_rag/retrieval/psq_retrieval.py::psq_candidates_from_rewrite_phrases`. | |
| --- | |
| ## Inputs | |
| - `rewrite_phrases: Sequence[str]` | |
| - `allow_nsfw_tags: bool` | |
| - `context_tags: Optional[Sequence[str]] = None` | |
| - `context_tag_weight: float = 1.0` | |
| - `context_weight: float = 0.5` | |
| - `per_phrase_k: int = 50` | |
| - `per_phrase_final_k: int = 1` | |
| - `global_k: int = 300` | |
| - `min_tag_count: int = 0` | |
| - `verbose: bool = False` | |
| - `return_phrase_ranks: bool = False` | |
| Notes: | |
| - In app runtime, these are usually overridden by env-backed settings: | |
| - `PSQ_RETRIEVAL_GLOBAL_K` (default 300) | |
| - `PSQ_RETRIEVAL_PER_PHRASE_K` (default 10) | |
| - `PSQ_RETRIEVAL_PER_PHRASE_FINAL_K` (default 1) | |
| - `PSQ_MIN_TAG_COUNT` (default 100 in app path) | |
| - Stage 2 may be called with structural/probe tags in `context_tags` to improve context scoring. | |
| --- | |
| ## Phrase Normalization | |
| Each incoming phrase is normalized to: | |
| - lowercase | |
| - underscores converted to spaces | |
| - trimmed and whitespace-collapsed | |
| Then phrases are deduplicated in first-seen order. | |
| --- | |
| ## Head-Term Expansion | |
| For each normalized multi-token phrase, Stage 2 may add the last token ("head") as an extra phrase. | |
| A head is added only when: | |
| - phrase has at least 2 tokens | |
| - head length is at least 3 | |
| - head is not in the built-in stopword list (`and`, `or`, `the`, `a`, `an`, `of`, `to`, `in`, `on`, ...). | |
| Final phrase list = deduped original phrases + valid heads (deduped again, order-preserving). | |
| --- | |
| ## Canonical Projection Rules | |
| All candidate generation is projected into canonical tags via `_project_to_canonicals(token)`: | |
| 1. If token itself appears as canonical (`token in tag_counts` or token has TF-IDF row), keep token. | |
| 2. Else if token is an alias (`token in alias2tags`), map to all alias targets. | |
| 3. Else drop token. | |
| `min_tag_count` filtering is applied during this projection. | |
| --- | |
| ## Per-Phrase Candidate Generation | |
| For each phrase: | |
| 1. Build `lookup = phrase.replace(" ", "_")`. | |
| 2. Compute `projected_lookup = _project_to_canonicals(lookup)`. | |
| 3. FastText neighbors: | |
| - normally: `fasttext_model.most_similar(lookup, topn=per_phrase_k)` | |
| - special case: | |
| - if `PSQ_SKIP_FASTTEXT_FOR_EXACT_ALIAS` is enabled (default `"1"`) **and** `projected_lookup` is non-empty, neighbor call is skipped for that phrase. | |
| 4. For each neighbor token: | |
| - project to canonical tags | |
| - apply NSFW filter if needed | |
| - keep best FastText similarity per canonical tag | |
| - keep `alias_token` that achieved that best similarity. | |
| 5. Exact-match injection: | |
| - each canonical tag in `projected_lookup` is injected with `score_fasttext = 1.0` | |
| - `alias_token = lookup`. | |
| This creates a per-phrase map: `tag -> score_fasttext` (+ best token for debug). | |
| --- | |
| ## Context Vector and Context Scores | |
| Stage 2 builds one request-level pseudo TF-IDF vector from: | |
| - normalized final phrases (lookup form, underscore) | |
| - optional `context_tags` weighted by `context_tag_weight`. | |
| Vector path: | |
| - TF-IDF sparse vector -> SVD transform -> L2 normalize. | |
| If zero norm: | |
| - `query_has_context = False` | |
| - all candidates use FastText-only combined score. | |
| If non-zero: | |
| - compute cosine vs normalized reduced TF-IDF row vectors for tags that have TF-IDF rows. | |
| - tags with no TF-IDF row initially get `score_context = None`. | |
| --- | |
| ## Missing-Context Imputation | |
| When `query_has_context = True`, per phrase: | |
| - gather non-None context scores for candidates in that phrase | |
| - `default_context_for_phrase` = 10th percentile (q=0.10) of those scores | |
| - if none exist, default is `0.0`. | |
| - tags missing context receive this default and are marked `context_imputed=True` (verbose report only). | |
| --- | |
| ## Score Fusion | |
| Per phrase candidate: | |
| - if no context vector: | |
| - `score_combined = score_fasttext` | |
| - else: | |
| - `score_combined = (1 - context_weight) * score_fasttext + context_weight * score_context` | |
| --- | |
| ## Per-Phrase Truncation and Required Tags | |
| Per phrase: | |
| - sort by `score_combined` desc | |
| - keep top `per_phrase_final_k`. | |
| Required tags: | |
| - tags from `projected_lookup` for that phrase. | |
| - each required tag must be present in that phrase's final list. | |
| - if insertion is needed and list is full, evict lowest-ranked non-required item (or last item if all are required). | |
| - re-sort by combined score after enforcement. | |
| --- | |
| ## Merge Across Phrases | |
| All per-phrase survivors are merged into a global map by canonical tag: | |
| - `sources`: union of phrases that contributed this tag | |
| - `score_fasttext`: max observed across phrases | |
| - `score_context`: max observed across phrases (None-aware) | |
| - `score_combined`: max observed across phrases | |
| - `count`: from corpus tag counts | |
| Then: | |
| - sort by `score_combined` desc | |
| - cut to `global_k` | |
| - return ordered `List[Candidate]`. | |
| --- | |
| ## Output Shapes | |
| `Candidate` fields: | |
| - `tag: str` | |
| - `score_combined: float` | |
| - `score_fasttext: Optional[float]` | |
| - `score_context: Optional[float]` | |
| - `count: Optional[int]` | |
| - `sources: List[str]` | |
| Return variants: | |
| - default: `List[Candidate]` | |
| - verbose: `(List[Candidate], phrase_reports)` | |
| - return_phrase_ranks only: `(List[Candidate], phrase_rank_by_tag)` | |
| - verbose + return_phrase_ranks: `(List[Candidate], phrase_reports, phrase_rank_by_tag)` | |
| `phrase_reports` rows include: | |
| - `phrase`, `normalized`, `lookup`, `tfidf_vocab`, `oov_terms` | |
| - candidate rows with `tag`, `alias_token`, `score_fasttext`, `score_context`, `score_combined`, `context_imputed`, `count`. | |
| --- | |
| ## Policy Filtering | |
| NSFW filtering uses `get_nsfw_tags()` from `psq_rag.retrieval.state`. | |
| Current source is `word_rating_probabilities.csv` with threshold `0.95`. | |
| --- | |
| ## Stage Boundary | |
| Stage 2 is retrieval grounding only. | |
| Final tag choice happens in Stage 3 closed-set selection. | |