Spaces:
Sleeping
Sleeping
File size: 10,241 Bytes
8afa490 dda5d84 dfcb62b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | ---
title: Satellite Patch Retrieve + Generate
emoji: π°οΈ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "5.50.0"
python_version: "3.10"
app_file: app.py
pinned: false
---
# Final Project Summary (Satellite Patch Retrieval + Generation)
This document summarizes Parts 1β4 of our project: dataset generation, EDA, embeddings, and the end-to-end pipeline.
---
## Part 1 β Synthetic Data Generation (with key terms)
In Part 1, we built a **synthetic satellite-like image dataset** using a pre-trained Hugging Face generative model. We used **`stabilityai/sd-turbo`** (a fast Stable Diffusion βTurboβ model) to generate **30 land-type classes** with **50 images per class** (**1500 images total**). Each label had its own prompt (e.g., forest, water, urban, runway), and we used a **negative prompt** to reduce unwanted artifacts such as text, logos, or cartoonish styles. The images were saved in a clean folder structure (`images/<label>/...jpg`) and documented in `metadata.csv` (`id`, `filename`, `label`, `prompt`, `seed`, `model_id`) so later parts (EDA, embeddings, and the app) could load and reuse the dataset easily.
### Key terms used
- **Diffusers:** Hugging Face library providing ready-to-use pipelines for diffusion-based generative models (e.g., Stable Diffusion). It loads the model and generates images from prompts.
- **Transformers:** Hugging Face library for Transformer-based models across text and vision. Used both as a dependency and later for embedding models (CLIP/ViT/DINOv2).
- **Tokenizers:** Converts text prompts into tokens/IDs the model can process; required for text-conditioned models (e.g., text-to-image).
- **Pillow (PIL):** Python imaging library for loading/manipulating/saving images (JPG/PNG), resizing, and file I/O.
- **`stabilityai/sd-turbo`:** Chosen because it is optimized for **speed** and can generate strong results with **1β2 inference steps**, enabling fast large-scale dataset creation.
---
## Part 2 β Exploratory Data Analysis (EDA)
- **Loaded and inspected metadata:** Read `metadata.csv` (1500 rows) with expected columns (`id`, `filename`, `label`, `prompt`, `seed`, `model_id`) and confirmed **30 classes**.
- **Integrity validation:** Verified **0 missing image files**, **0 duplicate ids**, **0 duplicate filenames**, and **0 duplicate (label, seed)** pairs.
- **Class balance check:** Confirmed a perfectly balanced dataset with **50 images per label** (min/max = 50/50).
- **Image consistency:** Confirmed all images have the same resolution (**384Γ384**).
- **Global image statistics:** Computed per-image RGB mean/std, **brightness** (luminance proxy), and a **sharpness proxy** (gradient-based), then reviewed distributions and summaries.
- **Outlier analysis:** Observed meaningful extremes consistent with labels:
- darkest samples mainly **DenseForest**
- brightest samples mainly **SnowIce**
- lowest-sharpness samples often from smoother-texture classes like **Grassland / DesertSand / SeaOpenWater**
- **Class-level insights:** Aggregated statistics by label (brightness/color tendencies) and used a simple **PCA projection** to visualize similarity/overlap between visually related classes.
---
## Part 3 β Embeddings (Similarity Search)
- **Goal:** Convert each satellite patch image into a compact vector (embedding) to enable **similarity search / retrieval** and support the later app pipeline.
- **Models tested (HF backbones):**
- **CLIP ViT-B/32** (`openai/clip-vit-base-patch32`)
- **ViT-Base** (`google/vit-base-patch16-224-in21k`)
- **DINOv2-Small** (`facebook/dinov2-small`)
- **Embedding extraction:**
- Used the **CLS token** from `last_hidden_state` as a single global image representation (standard for ViT-style models).
- Applied **L2-normalization** so cosine similarity becomes a fast dot product (stable and efficient retrieval).
- **Evaluation metric (retrieval-focused):** `label_agree@5` and `label_agree@10`
- For each image, retrieve its **top-k nearest neighbors** (cosine similarity).
- Measure the fraction of neighbors with the **same label** as the query.
- Average across all 1,500 images.
- This measures retrieval quality directly (not classifier accuracy).
- **Key results (quality + efficiency):**
- **DINOv2-Small performed best:** `agree@5 β 0.9247`, `agree@10 β 0.9006`
- Also produced **smaller embeddings** (384-dim) than CLIP/ViT (768-dim), reducing storage and improving retrieval efficiency.
- Selected **DINOv2-Small** as the optimal embedding model.
- **Saved outputs (reusable):**
- Embeddings: `*_embeddings.npy` (NumPy)
- Metadata mapping: `*_metadata.csv` (CSV)
- Comparison table: `embedding_model_comparison.csv` (CSV)
- **Qualitative validation:**
- **PCA scatter plot** to visualize clustering in 2D (sanity check for overlap/separability).
- **Nearest-neighbor gallery** to confirm retrieved results make sense visually and align with labels.
---
## Part 4 β End-to-End Pipeline (Retrieve + Generate)
- **Goal:** Build a production-style **Input β Processing β Output** pipeline that can be plugged directly into an app.
The user provides a satellite patch image plus a text prompt, and the system returns:
1) **Most similar images from the dataset (retrieval)**
2) **Newly generated images** via **image-to-image** and **text-to-image**
with user-controlled counts (**0β5 each**).
- **System architecture: two engines working together**
- **Retrieval engine (embedding-based):**
- Embed the user image with **DINOv2-Small** (best model from Part 3).
- Compare the query embedding against the stored embedding index:
- `best_embeddings.npy` (vectors) + `best_metadata.csv` (filename/label mapping).
- Compute similarity using **cosine similarity** (dot product due to L2 normalization).
- Return **Top-K** results (K β€ 5), each including image, label, similarity score, and filename.
- **Generation engine (Diffusers):**
- Use **`stabilityai/sd-turbo`** for fast generation (works well with 1β2 steps).
- Support two generation modes:
- **img2img:** generates variants that stay visually close to the user image, guided by the prompt.
- **txt2img:** generates new images purely from the prompt.
- User controls how many images to generate (0β5 each).
- **Pipeline inputs:**
- `user_img` β user-provided PIL image
- `user_prompt` β user-provided prompt (required for generation)
- `k_retrieve` β number of retrieved images (0β5)
- `n_i2i`, `n_t2i` β generated image counts (0β5 each)
- `strength_i2i` β img2img closeness (lower = closer to input)
- `steps` β generation steps (sd-turbo typically 1β2)
- `gen_size` β output size (e.g., 384 or 512)
- `seed` β reproducibility
- **Stability safeguards (app-ready):**
- Hard caps on counts (**0β5**) for retrieval and generation to prevent overload.
- A **safe-step rule** for img2img to avoid the β0 effective stepsβ Diffusers crash when strength is low.
- GPU optimizations when available: **fp16 + `torch.autocast`** for speed.
- **Reading the dataset directly from HF (course requirement):**
- Instead of local files, dataset images are loaded using **`hf_hub_download`** from:
- `LevyJonas/sat_land_patches`
- A cache directory is used to avoid repeated downloads.
- **Pipeline outputs:**
- `retrieved`: up to 5 retrieved items (PIL image, label, similarity, filename)
- `gen_i2i`: up to 5 generated img2img images
- `gen_t2i`: up to 5 generated txt2img images
- `info`: summary dictionary (prompt, counts, steps/strength, dataset id, etc.)
- **Key takeaway:**
- Part 4 combines **retrieval (real examples from the dataset)** with **generation (new synthetic variants)** in one workflow, and is modular/UI-ready for Part 5 (Gradio sliders + galleries).
---
## Part 5 β Application (HF Space with Gradio)
- **Goal:** Deploy an interactive application that demonstrates the full workflow:
**Upload image + prompt β retrieve similar examples β generate new variants**.
This turns the pipeline from Part 4 into a user-facing product-like demo.
- **Platform:** Hugging Face **Spaces** using **Gradio** (`app.py` as the entry point).
- **UI Inputs (user controls):**
- **Image upload**: user provides a satellite patch (PIL image).
- **Prompt textbox**: user writes the prompt (required for generation).
- **Sliders (0β5)**:
- `k_retrieve`: number of retrieved dataset images (0β5)
- `n_i2i`: number of img2img generated images (0β5)
- `n_t2i`: number of txt2img generated images (0β5)
- **Generation settings**:
- `strength_i2i`: controls how close img2img stays to the input (lower = closer)
- `steps`: generation steps (1β2 recommended for sd-turbo)
- `gen_size`: output size (384 or 512)
- `seed`: reproducibility
- **Backend logic (connected to Part 4):**
- `app.py` calls `run_search_and_generate(...)` from `pipeline.py`.
- The pipeline:
- Embeds the uploaded image (DINOv2-Small)
- Retrieves Top-K similar images from the embedding index (`best_embeddings.npy` + `best_metadata.csv`)
- Generates new images using `stabilityai/sd-turbo` with:
- **img2img** conditioned on the uploaded image + prompt
- **txt2img** conditioned on the prompt only
- **Outputs shown to the user:**
- **Gallery 1 (Retrieved from dataset):** Top-K nearest neighbors with labels + cosine similarity scores.
- **Gallery 2 (Generated img2img):** New image variants close to the uploaded input.
- **Gallery 3 (Generated txt2img):** New images generated from the prompt.
- **Summary panel:** displays the chosen parameters and pipeline metadata (counts, steps, strength, dataset id, etc.).
- **Course requirement: read directly from HF dataset repo**
- Dataset images are loaded at runtime using `hf_hub_download` from:
- `LevyJonas/sat_land_patches`
- A local cache is used in the Space to avoid repeated downloads.
- **Deployment notes:**
- For practical generation speed, the Space should run on **GPU** hardware.
- Embedding files (`best_embeddings.npy`, `best_metadata.csv`) are stored in the Space repo so the app can start instantly. |