Prompt_Squirrel_RAG / SESSION_QUICKSTART.md
Food Desert
Add eval audit tools, caption-evident set, and logging
73f56cf

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

Quick Start for New Sessions

What is Prompt Squirrel?

A RAG system that converts natural language prompts β†’ e621-style tags for furry art generation.

Three-Stage Pipeline

  1. Stage 1 (Rewrite): Natural language β†’ tag-shaped phrases (LLM)
  2. Stage 2 (Retrieval): Phrases β†’ candidate tags (FastText + TF-IDF/SVD, closed vocab)
  3. Stage 3 (Selection): Candidates β†’ final selected tags (LLM)
  4. Stage 3s (Structural): Selected tags β†’ structural inferences (optional, e.g., clothing β†’ topless)

Latest Features (Feb 13-14, 2026)

  • Tag Categorization: Organized suggestions by e621 checklist categories (species, clothing, posture, etc.)
  • Category Parser: Parses checklist with tiers (CRITICAL/IMPORTANT/NICE_TO_HAVE/META) and constraints
  • Evaluation Metrics: Per-category P/R/F1, ranking metrics (MRR, P@K, nDCG)
  • Multi-select Constraints: Fixed body_type, species, gender to allow multiple tags

Key Files

  • app.py - Gradio web interface
  • psq_rag/tagging/categorized_suggestions.py - Category-based tag suggestions
  • psq_rag/tagging/category_parser.py - Parse e621 checklist
  • scripts/eval_pipeline.py - Main evaluation harness
  • scripts/eval_categorized.py - Per-category metrics
  • scripts/analyze_threshold_grid.py - Threshold grid analysis (score/global rank/phrase rank)
  • scripts/analyze_caption_evident_audit.py - Caption-evident audit vs retrieval
  • docs/retrieval_contract.md - Stage 2 spec
  • docs/stage3_contract.md - Stage 3 spec
  • tagging_checklist.txt - E621 tagging guidelines

Running Code

# Always from repo root
.venv/Scripts/python.exe -m pip install -r requirements.txt  # Windows
.venv/Scripts/python.exe app.py

Recent Git History (Last 5 commits)

0f73a4b - Fix eval_categorized.py to work with eval_pipeline.py output
ff407fc - Remove binary PNG files (use Hugging Face XET storage instead)
8ba971a - Add eval results for debugging
51b7109 - Add ranking metrics infrastructure to eval pipeline
edba146 - Add per-category evaluation metrics script

Key Contracts to Remember

  1. Stage boundaries are strict: Don't mix retrieval (Stage 2) with selection (Stage 3)
  2. Keep diffs small: One focused change per commit
  3. Code matches contracts: Update code to match docs, not vice versa
  4. No feature flags: Delete old code paths, no legacy behavior switches

Quick Orientation Commands

# View project structure
ls -la

# View recent commits
git log --oneline -10

# Check current branch
git branch

# List Python modules
ls -la psq_rag/

# View evaluation results
ls -la data/eval_results/

Common Tasks

  • Add category: Edit tagging_checklist.txt, update parser
  • Eval changes: Run scripts/eval_pipeline.py, then scripts/eval_categorized.py
  • Threshold sweeps: Run scripts/analyze_threshold_grid.py (see --mode score|rank|phrase_rank)
  • Caption-evident audit: Run scripts/analyze_caption_evident_audit.py
  • Test retrieval: Use scripts/smoke_test.py
  • Debug Stage 3: Use scripts/stage3_debug.py (--phrases optional; omitted runs Stage 1 rewrite first, then Stage 2 retrieval from rewritten phrases)

Data Artifacts (Lazy-loaded)

  • FastText embeddings (semantic similarity)
  • TF-IDF + SVD matrices (context similarity)
  • Alias β†’ canonical tag mappings
  • Tag counts, implications, groups, wiki definitions

Eval Datasets

  • data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_expanded.jsonl - Base eval set (implication-expanded GT)
  • data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident.jsonl - Caption-evident GT subset (10 samples); used to estimate retrieval ceiling from text

New Eval Features (Feb 2026)

  • eval_pipeline.py now logs Stage 3 selection scores and ranks:
    • stage3_selected_scores (retrieval score)
    • stage3_selected_ranks (global rank)
    • stage3_selected_phrase_ranks (per-phrase rank)
  • New CLI flag: --per-phrase-final-k to control per-phrase retrieval cap

NSFW Handling

  • Filtered via word_rating_probabilities.csv (threshold 0.95)
  • Stage 2 removes NSFW tags when allow_nsfw_tags=False
  • Stage 3 doesn't need policy flags (defense-in-depth only)