LevyJonas's picture
Update README.md
8afa490 verified
---
title: Satellite Patch Retrieve + Generate
emoji: πŸ›°οΈ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "5.50.0"
python_version: "3.10"
app_file: app.py
pinned: false
---
# Final Project Summary (Satellite Patch Retrieval + Generation)
This document summarizes Parts 1–4 of our project: dataset generation, EDA, embeddings, and the end-to-end pipeline.
---
## Part 1 β€” Synthetic Data Generation (with key terms)
In Part 1, we built a **synthetic satellite-like image dataset** using a pre-trained Hugging Face generative model. We used **`stabilityai/sd-turbo`** (a fast Stable Diffusion β€œTurbo” model) to generate **30 land-type classes** with **50 images per class** (**1500 images total**). Each label had its own prompt (e.g., forest, water, urban, runway), and we used a **negative prompt** to reduce unwanted artifacts such as text, logos, or cartoonish styles. The images were saved in a clean folder structure (`images/<label>/...jpg`) and documented in `metadata.csv` (`id`, `filename`, `label`, `prompt`, `seed`, `model_id`) so later parts (EDA, embeddings, and the app) could load and reuse the dataset easily.
### Key terms used
- **Diffusers:** Hugging Face library providing ready-to-use pipelines for diffusion-based generative models (e.g., Stable Diffusion). It loads the model and generates images from prompts.
- **Transformers:** Hugging Face library for Transformer-based models across text and vision. Used both as a dependency and later for embedding models (CLIP/ViT/DINOv2).
- **Tokenizers:** Converts text prompts into tokens/IDs the model can process; required for text-conditioned models (e.g., text-to-image).
- **Pillow (PIL):** Python imaging library for loading/manipulating/saving images (JPG/PNG), resizing, and file I/O.
- **`stabilityai/sd-turbo`:** Chosen because it is optimized for **speed** and can generate strong results with **1–2 inference steps**, enabling fast large-scale dataset creation.
---
## Part 2 β€” Exploratory Data Analysis (EDA)
- **Loaded and inspected metadata:** Read `metadata.csv` (1500 rows) with expected columns (`id`, `filename`, `label`, `prompt`, `seed`, `model_id`) and confirmed **30 classes**.
- **Integrity validation:** Verified **0 missing image files**, **0 duplicate ids**, **0 duplicate filenames**, and **0 duplicate (label, seed)** pairs.
- **Class balance check:** Confirmed a perfectly balanced dataset with **50 images per label** (min/max = 50/50).
- **Image consistency:** Confirmed all images have the same resolution (**384Γ—384**).
- **Global image statistics:** Computed per-image RGB mean/std, **brightness** (luminance proxy), and a **sharpness proxy** (gradient-based), then reviewed distributions and summaries.
- **Outlier analysis:** Observed meaningful extremes consistent with labels:
- darkest samples mainly **DenseForest**
- brightest samples mainly **SnowIce**
- lowest-sharpness samples often from smoother-texture classes like **Grassland / DesertSand / SeaOpenWater**
- **Class-level insights:** Aggregated statistics by label (brightness/color tendencies) and used a simple **PCA projection** to visualize similarity/overlap between visually related classes.
---
## Part 3 β€” Embeddings (Similarity Search)
- **Goal:** Convert each satellite patch image into a compact vector (embedding) to enable **similarity search / retrieval** and support the later app pipeline.
- **Models tested (HF backbones):**
- **CLIP ViT-B/32** (`openai/clip-vit-base-patch32`)
- **ViT-Base** (`google/vit-base-patch16-224-in21k`)
- **DINOv2-Small** (`facebook/dinov2-small`)
- **Embedding extraction:**
- Used the **CLS token** from `last_hidden_state` as a single global image representation (standard for ViT-style models).
- Applied **L2-normalization** so cosine similarity becomes a fast dot product (stable and efficient retrieval).
- **Evaluation metric (retrieval-focused):** `label_agree@5` and `label_agree@10`
- For each image, retrieve its **top-k nearest neighbors** (cosine similarity).
- Measure the fraction of neighbors with the **same label** as the query.
- Average across all 1,500 images.
- This measures retrieval quality directly (not classifier accuracy).
- **Key results (quality + efficiency):**
- **DINOv2-Small performed best:** `agree@5 β‰ˆ 0.9247`, `agree@10 β‰ˆ 0.9006`
- Also produced **smaller embeddings** (384-dim) than CLIP/ViT (768-dim), reducing storage and improving retrieval efficiency.
- Selected **DINOv2-Small** as the optimal embedding model.
- **Saved outputs (reusable):**
- Embeddings: `*_embeddings.npy` (NumPy)
- Metadata mapping: `*_metadata.csv` (CSV)
- Comparison table: `embedding_model_comparison.csv` (CSV)
- **Qualitative validation:**
- **PCA scatter plot** to visualize clustering in 2D (sanity check for overlap/separability).
- **Nearest-neighbor gallery** to confirm retrieved results make sense visually and align with labels.
---
## Part 4 β€” End-to-End Pipeline (Retrieve + Generate)
- **Goal:** Build a production-style **Input β†’ Processing β†’ Output** pipeline that can be plugged directly into an app.
The user provides a satellite patch image plus a text prompt, and the system returns:
1) **Most similar images from the dataset (retrieval)**
2) **Newly generated images** via **image-to-image** and **text-to-image**
with user-controlled counts (**0–5 each**).
- **System architecture: two engines working together**
- **Retrieval engine (embedding-based):**
- Embed the user image with **DINOv2-Small** (best model from Part 3).
- Compare the query embedding against the stored embedding index:
- `best_embeddings.npy` (vectors) + `best_metadata.csv` (filename/label mapping).
- Compute similarity using **cosine similarity** (dot product due to L2 normalization).
- Return **Top-K** results (K ≀ 5), each including image, label, similarity score, and filename.
- **Generation engine (Diffusers):**
- Use **`stabilityai/sd-turbo`** for fast generation (works well with 1–2 steps).
- Support two generation modes:
- **img2img:** generates variants that stay visually close to the user image, guided by the prompt.
- **txt2img:** generates new images purely from the prompt.
- User controls how many images to generate (0–5 each).
- **Pipeline inputs:**
- `user_img` β€” user-provided PIL image
- `user_prompt` β€” user-provided prompt (required for generation)
- `k_retrieve` β€” number of retrieved images (0–5)
- `n_i2i`, `n_t2i` β€” generated image counts (0–5 each)
- `strength_i2i` β€” img2img closeness (lower = closer to input)
- `steps` β€” generation steps (sd-turbo typically 1–2)
- `gen_size` β€” output size (e.g., 384 or 512)
- `seed` β€” reproducibility
- **Stability safeguards (app-ready):**
- Hard caps on counts (**0–5**) for retrieval and generation to prevent overload.
- A **safe-step rule** for img2img to avoid the β€œ0 effective steps” Diffusers crash when strength is low.
- GPU optimizations when available: **fp16 + `torch.autocast`** for speed.
- **Reading the dataset directly from HF (course requirement):**
- Instead of local files, dataset images are loaded using **`hf_hub_download`** from:
- `LevyJonas/sat_land_patches`
- A cache directory is used to avoid repeated downloads.
- **Pipeline outputs:**
- `retrieved`: up to 5 retrieved items (PIL image, label, similarity, filename)
- `gen_i2i`: up to 5 generated img2img images
- `gen_t2i`: up to 5 generated txt2img images
- `info`: summary dictionary (prompt, counts, steps/strength, dataset id, etc.)
- **Key takeaway:**
- Part 4 combines **retrieval (real examples from the dataset)** with **generation (new synthetic variants)** in one workflow, and is modular/UI-ready for Part 5 (Gradio sliders + galleries).
---
## Part 5 β€” Application (HF Space with Gradio)
- **Goal:** Deploy an interactive application that demonstrates the full workflow:
**Upload image + prompt β†’ retrieve similar examples β†’ generate new variants**.
This turns the pipeline from Part 4 into a user-facing product-like demo.
- **Platform:** Hugging Face **Spaces** using **Gradio** (`app.py` as the entry point).
- **UI Inputs (user controls):**
- **Image upload**: user provides a satellite patch (PIL image).
- **Prompt textbox**: user writes the prompt (required for generation).
- **Sliders (0–5)**:
- `k_retrieve`: number of retrieved dataset images (0–5)
- `n_i2i`: number of img2img generated images (0–5)
- `n_t2i`: number of txt2img generated images (0–5)
- **Generation settings**:
- `strength_i2i`: controls how close img2img stays to the input (lower = closer)
- `steps`: generation steps (1–2 recommended for sd-turbo)
- `gen_size`: output size (384 or 512)
- `seed`: reproducibility
- **Backend logic (connected to Part 4):**
- `app.py` calls `run_search_and_generate(...)` from `pipeline.py`.
- The pipeline:
- Embeds the uploaded image (DINOv2-Small)
- Retrieves Top-K similar images from the embedding index (`best_embeddings.npy` + `best_metadata.csv`)
- Generates new images using `stabilityai/sd-turbo` with:
- **img2img** conditioned on the uploaded image + prompt
- **txt2img** conditioned on the prompt only
- **Outputs shown to the user:**
- **Gallery 1 (Retrieved from dataset):** Top-K nearest neighbors with labels + cosine similarity scores.
- **Gallery 2 (Generated img2img):** New image variants close to the uploaded input.
- **Gallery 3 (Generated txt2img):** New images generated from the prompt.
- **Summary panel:** displays the chosen parameters and pipeline metadata (counts, steps, strength, dataset id, etc.).
- **Course requirement: read directly from HF dataset repo**
- Dataset images are loaded at runtime using `hf_hub_download` from:
- `LevyJonas/sat_land_patches`
- A local cache is used in the Space to avoid repeated downloads.
- **Deployment notes:**
- For practical generation speed, the Space should run on **GPU** hardware.
- Embedding files (`best_embeddings.npy`, `best_metadata.csv`) are stored in the Space repo so the app can start instantly.