# Prompt Squirrel RAG: System Overview This document explains what Prompt Squirrel does, why it is structured this way, and how data moves through the system. ## Purpose Prompt Squirrel converts a rough natural-language prompt into a structured, editable tag list from a fixed image-tag vocabulary, then lets the user refine that list interactively. Design goals: - Keep generation grounded in a closed tag vocabulary. - Balance recall (find good candidates) with precision (avoid bad tags). - Keep UI editable so users remain in control. - Run reliably in a Hugging Face Space with constrained resources. ## What Each Step Does - `Rewrite`: Turns the user prompt into short, tag-like pseudo-phrases that are easier to match in vector retrieval. These phrases are optimized as search queries for candidate lookup. - `Structural Inference`: Runs an LLM call over a fixed set of high-level structure tags (for example character count, body type, gender, clothing state, gaze/text). It outputs only the structural tags it believes are supported. - `Probe Inference`: Runs a separate LLM call over a small, curated set of informative tags. This is a targeted check for tags that are often useful for reranking and final selection. - `Retrieval Candidates`: Uses the rewrite phrases (plus structural/probe context) to fetch candidate tags from the fixed vocabulary, prioritizing recall. - `Closed-Set Selection`: Runs an LLM call that can only choose from the retrieved candidate list. It cannot invent new tags. - `Implication Expansion`: Adds parent/related tags implied by selected tags according to the implication graph. - `Ranked Rows`: Groups and orders suggested tags into row categories for editing. - `Toggle UI and Suggested Prompt`: Lets the user turn tags on/off and see the resulting prompt text update immediately. ## Design Rationale - Rewrite and retrieval are separate so search phrase generation stays flexible while candidate generation stays deterministic. - Retrieval and closed-set selection are separate to keep high recall first, then apply higher-precision filtering. - Structural and probe inference run in parallel with rewrite so they can add context without adding much latency. - Users control the final prompt by toggling suggested tags on/off; the prompt text is generated from those toggle states. ## Data Inputs (Broad) - Tag vocabulary and alias mappings - Tag counts (frequency) - Tag implications graph - Group/category mappings for row display - Optional wiki definitions (used for hover help) ## Technologies Used - FastText embeddings for semantic tag retrieval. - HNSW approximate nearest-neighbor indexes for efficient retrieval at runtime. - Reduced TF-IDF vectors for context-aware ranking and row scoring. - OpenRouter-served instruction LLMs for rewrite, structural inference, probe inference, and closed-set selection. Default model: `mistralai/mistral-small-24b-instruct-2501`, chosen empirically from internal caption-evident test-set comparisons (with model choice remaining configurable). - Gradio for the interactive web UI (tag toggles, ranked rows, and suggested prompt text). - Python pipeline orchestration with CSV/JSON data sources and implication-graph expansion. ## Evaluation (Broad) Current evaluation style compares selected tags against ground-truth tags on caption-evident samples. Primary metrics: - Precision: `TP / (TP + FP)` - Recall: `TP / (TP + FN)` - F1: harmonic mean of precision/recall The evaluation focus is practical: - Is the returned tag set useful and mostly correct? - Does it miss important prompt-evident tags? - Does UI ranking surface likely-correct tags early? ## Evaluation Dataset Snapshot - File: `data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident_n30.jsonl` - Construction: manually curated caption-evident subset, where ground-truth tags are intended to be directly supported by the caption text. - Size: 30 images - Total ground-truth tag assignments: 440 - Unique tags represented: 205 - Average tags per image: 14.67