Spaces:

LevyJonas
/

SurfaceChangePredictor

Sleeping

File size: 10,241 Bytes

---
title: Satellite Patch Retrieve + Generate
emoji: 🛰️
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "5.50.0"
python_version: "3.10"
app_file: app.py
pinned: false
---

# Final Project Summary (Satellite Patch Retrieval + Generation)

This document summarizes Parts 1–4 of our project: dataset generation, EDA, embeddings, and the end-to-end pipeline.

---

## Part 1 — Synthetic Data Generation (with key terms)

In Part 1, we built a **synthetic satellite-like image dataset** using a pre-trained Hugging Face generative model. We used **`stabilityai/sd-turbo`** (a fast Stable Diffusion “Turbo” model) to generate **30 land-type classes** with **50 images per class** (**1500 images total**). Each label had its own prompt (e.g., forest, water, urban, runway), and we used a **negative prompt** to reduce unwanted artifacts such as text, logos, or cartoonish styles. The images were saved in a clean folder structure (`images/<label>/...jpg`) and documented in `metadata.csv` (`id`, `filename`, `label`, `prompt`, `seed`, `model_id`) so later parts (EDA, embeddings, and the app) could load and reuse the dataset easily.

### Key terms used
- **Diffusers:** Hugging Face library providing ready-to-use pipelines for diffusion-based generative models (e.g., Stable Diffusion). It loads the model and generates images from prompts.
- **Transformers:** Hugging Face library for Transformer-based models across text and vision. Used both as a dependency and later for embedding models (CLIP/ViT/DINOv2).
- **Tokenizers:** Converts text prompts into tokens/IDs the model can process; required for text-conditioned models (e.g., text-to-image).
- **Pillow (PIL):** Python imaging library for loading/manipulating/saving images (JPG/PNG), resizing, and file I/O.
- **`stabilityai/sd-turbo`:** Chosen because it is optimized for **speed** and can generate strong results with **1–2 inference steps**, enabling fast large-scale dataset creation.

---

## Part 2 — Exploratory Data Analysis (EDA)

- **Loaded and inspected metadata:** Read `metadata.csv` (1500 rows) with expected columns (`id`, `filename`, `label`, `prompt`, `seed`, `model_id`) and confirmed **30 classes**.
- **Integrity validation:** Verified **0 missing image files**, **0 duplicate ids**, **0 duplicate filenames**, and **0 duplicate (label, seed)** pairs.
- **Class balance check:** Confirmed a perfectly balanced dataset with **50 images per label** (min/max = 50/50).
- **Image consistency:** Confirmed all images have the same resolution (**384×384**).
- **Global image statistics:** Computed per-image RGB mean/std, **brightness** (luminance proxy), and a **sharpness proxy** (gradient-based), then reviewed distributions and summaries.
- **Outlier analysis:** Observed meaningful extremes consistent with labels:
  - darkest samples mainly **DenseForest**
  - brightest samples mainly **SnowIce**
  - lowest-sharpness samples often from smoother-texture classes like **Grassland / DesertSand / SeaOpenWater**
- **Class-level insights:** Aggregated statistics by label (brightness/color tendencies) and used a simple **PCA projection** to visualize similarity/overlap between visually related classes.

---

## Part 3 — Embeddings (Similarity Search)

- **Goal:** Convert each satellite patch image into a compact vector (embedding) to enable **similarity search / retrieval** and support the later app pipeline.
- **Models tested (HF backbones):**
  - **CLIP ViT-B/32** (`openai/clip-vit-base-patch32`)
  - **ViT-Base** (`google/vit-base-patch16-224-in21k`)
  - **DINOv2-Small** (`facebook/dinov2-small`)
- **Embedding extraction:**
  - Used the **CLS token** from `last_hidden_state` as a single global image representation (standard for ViT-style models).
  - Applied **L2-normalization** so cosine similarity becomes a fast dot product (stable and efficient retrieval).
- **Evaluation metric (retrieval-focused):** `label_agree@5` and `label_agree@10`
  - For each image, retrieve its **top-k nearest neighbors** (cosine similarity).
  - Measure the fraction of neighbors with the **same label** as the query.
  - Average across all 1,500 images.
  - This measures retrieval quality directly (not classifier accuracy).
- **Key results (quality + efficiency):**
  - **DINOv2-Small performed best:** `agree@5 ≈ 0.9247`, `agree@10 ≈ 0.9006`
  - Also produced **smaller embeddings** (384-dim) than CLIP/ViT (768-dim), reducing storage and improving retrieval efficiency.
  - Selected **DINOv2-Small** as the optimal embedding model.
- **Saved outputs (reusable):**
  - Embeddings: `*_embeddings.npy` (NumPy)
  - Metadata mapping: `*_metadata.csv` (CSV)
  - Comparison table: `embedding_model_comparison.csv` (CSV)
- **Qualitative validation:**
  - **PCA scatter plot** to visualize clustering in 2D (sanity check for overlap/separability).
  - **Nearest-neighbor gallery** to confirm retrieved results make sense visually and align with labels.

---

## Part 4 — End-to-End Pipeline (Retrieve + Generate)

- **Goal:** Build a production-style **Input → Processing → Output** pipeline that can be plugged directly into an app.  
  The user provides a satellite patch image plus a text prompt, and the system returns:
  1) **Most similar images from the dataset (retrieval)**  
  2) **Newly generated images** via **image-to-image** and **text-to-image**  
  with user-controlled counts (**0–5 each**).

- **System architecture: two engines working together**
  - **Retrieval engine (embedding-based):**
    - Embed the user image with **DINOv2-Small** (best model from Part 3).
    - Compare the query embedding against the stored embedding index:
      - `best_embeddings.npy` (vectors) + `best_metadata.csv` (filename/label mapping).
    - Compute similarity using **cosine similarity** (dot product due to L2 normalization).
    - Return **Top-K** results (K ≤ 5), each including image, label, similarity score, and filename.
  - **Generation engine (Diffusers):**
    - Use **`stabilityai/sd-turbo`** for fast generation (works well with 1–2 steps).
    - Support two generation modes:
      - **img2img:** generates variants that stay visually close to the user image, guided by the prompt.
      - **txt2img:** generates new images purely from the prompt.
    - User controls how many images to generate (0–5 each).

- **Pipeline inputs:**
  - `user_img` — user-provided PIL image
  - `user_prompt` — user-provided prompt (required for generation)
  - `k_retrieve` — number of retrieved images (0–5)
  - `n_i2i`, `n_t2i` — generated image counts (0–5 each)
  - `strength_i2i` — img2img closeness (lower = closer to input)
  - `steps` — generation steps (sd-turbo typically 1–2)
  - `gen_size` — output size (e.g., 384 or 512)
  - `seed` — reproducibility

- **Stability safeguards (app-ready):**
  - Hard caps on counts (**0–5**) for retrieval and generation to prevent overload.
  - A **safe-step rule** for img2img to avoid the “0 effective steps” Diffusers crash when strength is low.
  - GPU optimizations when available: **fp16 + `torch.autocast`** for speed.

- **Reading the dataset directly from HF (course requirement):**
  - Instead of local files, dataset images are loaded using **`hf_hub_download`** from:
    - `LevyJonas/sat_land_patches`
  - A cache directory is used to avoid repeated downloads.

- **Pipeline outputs:**
  - `retrieved`: up to 5 retrieved items (PIL image, label, similarity, filename)
  - `gen_i2i`: up to 5 generated img2img images
  - `gen_t2i`: up to 5 generated txt2img images
  - `info`: summary dictionary (prompt, counts, steps/strength, dataset id, etc.)

- **Key takeaway:**
  - Part 4 combines **retrieval (real examples from the dataset)** with **generation (new synthetic variants)** in one workflow, and is modular/UI-ready for Part 5 (Gradio sliders + galleries).

---

## Part 5 — Application (HF Space with Gradio)

- **Goal:** Deploy an interactive application that demonstrates the full workflow:
  **Upload image + prompt → retrieve similar examples → generate new variants**.
  This turns the pipeline from Part 4 into a user-facing product-like demo.

- **Platform:** Hugging Face **Spaces** using **Gradio** (`app.py` as the entry point).

- **UI Inputs (user controls):**
  - **Image upload**: user provides a satellite patch (PIL image).
  - **Prompt textbox**: user writes the prompt (required for generation).
  - **Sliders (0–5)**:
    - `k_retrieve`: number of retrieved dataset images (0–5)
    - `n_i2i`: number of img2img generated images (0–5)
    - `n_t2i`: number of txt2img generated images (0–5)
  - **Generation settings**:
    - `strength_i2i`: controls how close img2img stays to the input (lower = closer)
    - `steps`: generation steps (1–2 recommended for sd-turbo)
    - `gen_size`: output size (384 or 512)
    - `seed`: reproducibility

- **Backend logic (connected to Part 4):**
  - `app.py` calls `run_search_and_generate(...)` from `pipeline.py`.
  - The pipeline:
    - Embeds the uploaded image (DINOv2-Small)
    - Retrieves Top-K similar images from the embedding index (`best_embeddings.npy` + `best_metadata.csv`)
    - Generates new images using `stabilityai/sd-turbo` with:
      - **img2img** conditioned on the uploaded image + prompt
      - **txt2img** conditioned on the prompt only

- **Outputs shown to the user:**
  - **Gallery 1 (Retrieved from dataset):** Top-K nearest neighbors with labels + cosine similarity scores.
  - **Gallery 2 (Generated img2img):** New image variants close to the uploaded input.
  - **Gallery 3 (Generated txt2img):** New images generated from the prompt.
  - **Summary panel:** displays the chosen parameters and pipeline metadata (counts, steps, strength, dataset id, etc.).

- **Course requirement: read directly from HF dataset repo**
  - Dataset images are loaded at runtime using `hf_hub_download` from:
    - `LevyJonas/sat_land_patches`
  - A local cache is used in the Space to avoid repeated downloads.

- **Deployment notes:**
  - For practical generation speed, the Space should run on **GPU** hardware.
  - Embedding files (`best_embeddings.npy`, `best_metadata.csv`) are stored in the Space repo so the app can start instantly.