Spaces:
Running
Running
| # Prompt Squirrel RAG: System Overview | |
| This document explains what Prompt Squirrel does, why it is structured this way, and how data moves through the system. | |
| ## Purpose | |
| Prompt Squirrel converts a rough natural-language prompt into a structured, editable tag list from a fixed image-tag vocabulary, then lets the user refine that list interactively. | |
| Design goals: | |
| - Keep generation grounded in a closed tag vocabulary. | |
| - Balance recall (find good candidates) with precision (avoid bad tags). | |
| - Keep UI editable so users remain in control. | |
| - Run reliably in a Hugging Face Space with constrained resources. | |
| ## What Each Step Does | |
| - `Rewrite`: | |
| Turns the user prompt into short, tag-like pseudo-phrases that are easier to match in vector retrieval. These phrases are optimized as search queries for candidate lookup. | |
| - `Structural Inference`: | |
| Runs an LLM call over a fixed set of high-level structure tags (for example character count, body type, gender, clothing state, gaze/text). It outputs only the structural tags it believes are supported. | |
| - `Probe Inference`: | |
| Runs a separate LLM call over a small, curated set of informative tags. This is a targeted check for tags that are often useful for reranking and final selection. | |
| - `Retrieval Candidates`: | |
| Uses the rewrite phrases (plus structural/probe context) to fetch candidate tags from the fixed vocabulary, prioritizing recall. | |
| - `Closed-Set Selection`: | |
| Runs an LLM call that can only choose from the retrieved candidate list. It cannot invent new tags. | |
| - `Implication Expansion`: | |
| Adds parent/related tags implied by selected tags according to the implication graph. | |
| - `Ranked Rows`: | |
| Groups and orders suggested tags into row categories for editing. | |
| - `Toggle UI and Suggested Prompt`: | |
| Lets the user turn tags on/off and see the resulting prompt text update immediately. | |
| ## Design Rationale | |
| - Rewrite and retrieval are separate so search phrase generation stays flexible while candidate generation stays deterministic. | |
| - Retrieval and closed-set selection are separate to keep high recall first, then apply higher-precision filtering. | |
| - Structural and probe inference run in parallel with rewrite so they can add context without adding much latency. | |
| - Users control the final prompt by toggling suggested tags on/off; the prompt text is generated from those toggle states. | |
| ## Data Inputs (Broad) | |
| - Tag vocabulary and alias mappings | |
| - Tag counts (frequency) | |
| - Tag implications graph | |
| - Group/category mappings for row display | |
| - Optional wiki definitions (used for hover help) | |
| ## Technologies Used | |
| - FastText embeddings for semantic tag retrieval. | |
| - HNSW approximate nearest-neighbor indexes for efficient retrieval at runtime. | |
| - Reduced TF-IDF vectors for context-aware ranking and row scoring. | |
| - OpenRouter-served instruction LLMs for rewrite, structural inference, probe inference, and closed-set selection. | |
| Default model: `mistralai/mistral-small-24b-instruct-2501`, chosen empirically from internal caption-evident test-set comparisons (with model choice remaining configurable). | |
| - Gradio for the interactive web UI (tag toggles, ranked rows, and suggested prompt text). | |
| - Python pipeline orchestration with CSV/JSON data sources and implication-graph expansion. | |
| ## Evaluation (Broad) | |
| Current evaluation style compares selected tags against ground-truth tags on caption-evident samples. | |
| Primary metrics: | |
| - Precision: `TP / (TP + FP)` | |
| - Recall: `TP / (TP + FN)` | |
| - F1: harmonic mean of precision/recall | |
| The evaluation focus is practical: | |
| - Is the returned tag set useful and mostly correct? | |
| - Does it miss important prompt-evident tags? | |
| - Does UI ranking surface likely-correct tags early? | |
| ## Evaluation Dataset Snapshot | |
| - File: `data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident_n30.jsonl` | |
| - Construction: manually curated caption-evident subset, where ground-truth tags are intended to be directly supported by the caption text. | |
| - Size: 30 images | |
| - Total ground-truth tag assignments: 440 | |
| - Unique tags represented: 205 | |
| - Average tags per image: 14.67 | |