Spaces:
Sleeping
Sleeping
| title: Satellite Patch Retrieve + Generate | |
| emoji: π°οΈ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: "5.50.0" | |
| python_version: "3.10" | |
| app_file: app.py | |
| pinned: false | |
| # Final Project Summary (Satellite Patch Retrieval + Generation) | |
| This document summarizes Parts 1β4 of our project: dataset generation, EDA, embeddings, and the end-to-end pipeline. | |
| --- | |
| ## Part 1 β Synthetic Data Generation (with key terms) | |
| In Part 1, we built a **synthetic satellite-like image dataset** using a pre-trained Hugging Face generative model. We used **`stabilityai/sd-turbo`** (a fast Stable Diffusion βTurboβ model) to generate **30 land-type classes** with **50 images per class** (**1500 images total**). Each label had its own prompt (e.g., forest, water, urban, runway), and we used a **negative prompt** to reduce unwanted artifacts such as text, logos, or cartoonish styles. The images were saved in a clean folder structure (`images/<label>/...jpg`) and documented in `metadata.csv` (`id`, `filename`, `label`, `prompt`, `seed`, `model_id`) so later parts (EDA, embeddings, and the app) could load and reuse the dataset easily. | |
| ### Key terms used | |
| - **Diffusers:** Hugging Face library providing ready-to-use pipelines for diffusion-based generative models (e.g., Stable Diffusion). It loads the model and generates images from prompts. | |
| - **Transformers:** Hugging Face library for Transformer-based models across text and vision. Used both as a dependency and later for embedding models (CLIP/ViT/DINOv2). | |
| - **Tokenizers:** Converts text prompts into tokens/IDs the model can process; required for text-conditioned models (e.g., text-to-image). | |
| - **Pillow (PIL):** Python imaging library for loading/manipulating/saving images (JPG/PNG), resizing, and file I/O. | |
| - **`stabilityai/sd-turbo`:** Chosen because it is optimized for **speed** and can generate strong results with **1β2 inference steps**, enabling fast large-scale dataset creation. | |
| --- | |
| ## Part 2 β Exploratory Data Analysis (EDA) | |
| - **Loaded and inspected metadata:** Read `metadata.csv` (1500 rows) with expected columns (`id`, `filename`, `label`, `prompt`, `seed`, `model_id`) and confirmed **30 classes**. | |
| - **Integrity validation:** Verified **0 missing image files**, **0 duplicate ids**, **0 duplicate filenames**, and **0 duplicate (label, seed)** pairs. | |
| - **Class balance check:** Confirmed a perfectly balanced dataset with **50 images per label** (min/max = 50/50). | |
| - **Image consistency:** Confirmed all images have the same resolution (**384Γ384**). | |
| - **Global image statistics:** Computed per-image RGB mean/std, **brightness** (luminance proxy), and a **sharpness proxy** (gradient-based), then reviewed distributions and summaries. | |
| - **Outlier analysis:** Observed meaningful extremes consistent with labels: | |
| - darkest samples mainly **DenseForest** | |
| - brightest samples mainly **SnowIce** | |
| - lowest-sharpness samples often from smoother-texture classes like **Grassland / DesertSand / SeaOpenWater** | |
| - **Class-level insights:** Aggregated statistics by label (brightness/color tendencies) and used a simple **PCA projection** to visualize similarity/overlap between visually related classes. | |
| --- | |
| ## Part 3 β Embeddings (Similarity Search) | |
| - **Goal:** Convert each satellite patch image into a compact vector (embedding) to enable **similarity search / retrieval** and support the later app pipeline. | |
| - **Models tested (HF backbones):** | |
| - **CLIP ViT-B/32** (`openai/clip-vit-base-patch32`) | |
| - **ViT-Base** (`google/vit-base-patch16-224-in21k`) | |
| - **DINOv2-Small** (`facebook/dinov2-small`) | |
| - **Embedding extraction:** | |
| - Used the **CLS token** from `last_hidden_state` as a single global image representation (standard for ViT-style models). | |
| - Applied **L2-normalization** so cosine similarity becomes a fast dot product (stable and efficient retrieval). | |
| - **Evaluation metric (retrieval-focused):** `label_agree@5` and `label_agree@10` | |
| - For each image, retrieve its **top-k nearest neighbors** (cosine similarity). | |
| - Measure the fraction of neighbors with the **same label** as the query. | |
| - Average across all 1,500 images. | |
| - This measures retrieval quality directly (not classifier accuracy). | |
| - **Key results (quality + efficiency):** | |
| - **DINOv2-Small performed best:** `agree@5 β 0.9247`, `agree@10 β 0.9006` | |
| - Also produced **smaller embeddings** (384-dim) than CLIP/ViT (768-dim), reducing storage and improving retrieval efficiency. | |
| - Selected **DINOv2-Small** as the optimal embedding model. | |
| - **Saved outputs (reusable):** | |
| - Embeddings: `*_embeddings.npy` (NumPy) | |
| - Metadata mapping: `*_metadata.csv` (CSV) | |
| - Comparison table: `embedding_model_comparison.csv` (CSV) | |
| - **Qualitative validation:** | |
| - **PCA scatter plot** to visualize clustering in 2D (sanity check for overlap/separability). | |
| - **Nearest-neighbor gallery** to confirm retrieved results make sense visually and align with labels. | |
| --- | |
| ## Part 4 β End-to-End Pipeline (Retrieve + Generate) | |
| - **Goal:** Build a production-style **Input β Processing β Output** pipeline that can be plugged directly into an app. | |
| The user provides a satellite patch image plus a text prompt, and the system returns: | |
| 1) **Most similar images from the dataset (retrieval)** | |
| 2) **Newly generated images** via **image-to-image** and **text-to-image** | |
| with user-controlled counts (**0β5 each**). | |
| - **System architecture: two engines working together** | |
| - **Retrieval engine (embedding-based):** | |
| - Embed the user image with **DINOv2-Small** (best model from Part 3). | |
| - Compare the query embedding against the stored embedding index: | |
| - `best_embeddings.npy` (vectors) + `best_metadata.csv` (filename/label mapping). | |
| - Compute similarity using **cosine similarity** (dot product due to L2 normalization). | |
| - Return **Top-K** results (K β€ 5), each including image, label, similarity score, and filename. | |
| - **Generation engine (Diffusers):** | |
| - Use **`stabilityai/sd-turbo`** for fast generation (works well with 1β2 steps). | |
| - Support two generation modes: | |
| - **img2img:** generates variants that stay visually close to the user image, guided by the prompt. | |
| - **txt2img:** generates new images purely from the prompt. | |
| - User controls how many images to generate (0β5 each). | |
| - **Pipeline inputs:** | |
| - `user_img` β user-provided PIL image | |
| - `user_prompt` β user-provided prompt (required for generation) | |
| - `k_retrieve` β number of retrieved images (0β5) | |
| - `n_i2i`, `n_t2i` β generated image counts (0β5 each) | |
| - `strength_i2i` β img2img closeness (lower = closer to input) | |
| - `steps` β generation steps (sd-turbo typically 1β2) | |
| - `gen_size` β output size (e.g., 384 or 512) | |
| - `seed` β reproducibility | |
| - **Stability safeguards (app-ready):** | |
| - Hard caps on counts (**0β5**) for retrieval and generation to prevent overload. | |
| - A **safe-step rule** for img2img to avoid the β0 effective stepsβ Diffusers crash when strength is low. | |
| - GPU optimizations when available: **fp16 + `torch.autocast`** for speed. | |
| - **Reading the dataset directly from HF (course requirement):** | |
| - Instead of local files, dataset images are loaded using **`hf_hub_download`** from: | |
| - `LevyJonas/sat_land_patches` | |
| - A cache directory is used to avoid repeated downloads. | |
| - **Pipeline outputs:** | |
| - `retrieved`: up to 5 retrieved items (PIL image, label, similarity, filename) | |
| - `gen_i2i`: up to 5 generated img2img images | |
| - `gen_t2i`: up to 5 generated txt2img images | |
| - `info`: summary dictionary (prompt, counts, steps/strength, dataset id, etc.) | |
| - **Key takeaway:** | |
| - Part 4 combines **retrieval (real examples from the dataset)** with **generation (new synthetic variants)** in one workflow, and is modular/UI-ready for Part 5 (Gradio sliders + galleries). | |
| --- | |
| ## Part 5 β Application (HF Space with Gradio) | |
| - **Goal:** Deploy an interactive application that demonstrates the full workflow: | |
| **Upload image + prompt β retrieve similar examples β generate new variants**. | |
| This turns the pipeline from Part 4 into a user-facing product-like demo. | |
| - **Platform:** Hugging Face **Spaces** using **Gradio** (`app.py` as the entry point). | |
| - **UI Inputs (user controls):** | |
| - **Image upload**: user provides a satellite patch (PIL image). | |
| - **Prompt textbox**: user writes the prompt (required for generation). | |
| - **Sliders (0β5)**: | |
| - `k_retrieve`: number of retrieved dataset images (0β5) | |
| - `n_i2i`: number of img2img generated images (0β5) | |
| - `n_t2i`: number of txt2img generated images (0β5) | |
| - **Generation settings**: | |
| - `strength_i2i`: controls how close img2img stays to the input (lower = closer) | |
| - `steps`: generation steps (1β2 recommended for sd-turbo) | |
| - `gen_size`: output size (384 or 512) | |
| - `seed`: reproducibility | |
| - **Backend logic (connected to Part 4):** | |
| - `app.py` calls `run_search_and_generate(...)` from `pipeline.py`. | |
| - The pipeline: | |
| - Embeds the uploaded image (DINOv2-Small) | |
| - Retrieves Top-K similar images from the embedding index (`best_embeddings.npy` + `best_metadata.csv`) | |
| - Generates new images using `stabilityai/sd-turbo` with: | |
| - **img2img** conditioned on the uploaded image + prompt | |
| - **txt2img** conditioned on the prompt only | |
| - **Outputs shown to the user:** | |
| - **Gallery 1 (Retrieved from dataset):** Top-K nearest neighbors with labels + cosine similarity scores. | |
| - **Gallery 2 (Generated img2img):** New image variants close to the uploaded input. | |
| - **Gallery 3 (Generated txt2img):** New images generated from the prompt. | |
| - **Summary panel:** displays the chosen parameters and pipeline metadata (counts, steps, strength, dataset id, etc.). | |
| - **Course requirement: read directly from HF dataset repo** | |
| - Dataset images are loaded at runtime using `hf_hub_download` from: | |
| - `LevyJonas/sat_land_patches` | |
| - A local cache is used in the Space to avoid repeated downloads. | |
| - **Deployment notes:** | |
| - For practical generation speed, the Space should run on **GPU** hardware. | |
| - Embedding files (`best_embeddings.npy`, `best_metadata.csv`) are stored in the Space repo so the app can start instantly. |