File size: 10,241 Bytes
8afa490
 
 
 
 
 
 
 
 
 
 
 
dda5d84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dfcb62b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: Satellite Patch Retrieve + Generate
emoji: πŸ›°οΈ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "5.50.0"
python_version: "3.10"
app_file: app.py
pinned: false
---

# Final Project Summary (Satellite Patch Retrieval + Generation)

This document summarizes Parts 1–4 of our project: dataset generation, EDA, embeddings, and the end-to-end pipeline.

---

## Part 1 β€” Synthetic Data Generation (with key terms)

In Part 1, we built a **synthetic satellite-like image dataset** using a pre-trained Hugging Face generative model. We used **`stabilityai/sd-turbo`** (a fast Stable Diffusion β€œTurbo” model) to generate **30 land-type classes** with **50 images per class** (**1500 images total**). Each label had its own prompt (e.g., forest, water, urban, runway), and we used a **negative prompt** to reduce unwanted artifacts such as text, logos, or cartoonish styles. The images were saved in a clean folder structure (`images/<label>/...jpg`) and documented in `metadata.csv` (`id`, `filename`, `label`, `prompt`, `seed`, `model_id`) so later parts (EDA, embeddings, and the app) could load and reuse the dataset easily.

### Key terms used
- **Diffusers:** Hugging Face library providing ready-to-use pipelines for diffusion-based generative models (e.g., Stable Diffusion). It loads the model and generates images from prompts.
- **Transformers:** Hugging Face library for Transformer-based models across text and vision. Used both as a dependency and later for embedding models (CLIP/ViT/DINOv2).
- **Tokenizers:** Converts text prompts into tokens/IDs the model can process; required for text-conditioned models (e.g., text-to-image).
- **Pillow (PIL):** Python imaging library for loading/manipulating/saving images (JPG/PNG), resizing, and file I/O.
- **`stabilityai/sd-turbo`:** Chosen because it is optimized for **speed** and can generate strong results with **1–2 inference steps**, enabling fast large-scale dataset creation.

---

## Part 2 β€” Exploratory Data Analysis (EDA)

- **Loaded and inspected metadata:** Read `metadata.csv` (1500 rows) with expected columns (`id`, `filename`, `label`, `prompt`, `seed`, `model_id`) and confirmed **30 classes**.
- **Integrity validation:** Verified **0 missing image files**, **0 duplicate ids**, **0 duplicate filenames**, and **0 duplicate (label, seed)** pairs.
- **Class balance check:** Confirmed a perfectly balanced dataset with **50 images per label** (min/max = 50/50).
- **Image consistency:** Confirmed all images have the same resolution (**384Γ—384**).
- **Global image statistics:** Computed per-image RGB mean/std, **brightness** (luminance proxy), and a **sharpness proxy** (gradient-based), then reviewed distributions and summaries.
- **Outlier analysis:** Observed meaningful extremes consistent with labels:
  - darkest samples mainly **DenseForest**
  - brightest samples mainly **SnowIce**
  - lowest-sharpness samples often from smoother-texture classes like **Grassland / DesertSand / SeaOpenWater**
- **Class-level insights:** Aggregated statistics by label (brightness/color tendencies) and used a simple **PCA projection** to visualize similarity/overlap between visually related classes.

---

## Part 3 β€” Embeddings (Similarity Search)

- **Goal:** Convert each satellite patch image into a compact vector (embedding) to enable **similarity search / retrieval** and support the later app pipeline.
- **Models tested (HF backbones):**
  - **CLIP ViT-B/32** (`openai/clip-vit-base-patch32`)
  - **ViT-Base** (`google/vit-base-patch16-224-in21k`)
  - **DINOv2-Small** (`facebook/dinov2-small`)
- **Embedding extraction:**
  - Used the **CLS token** from `last_hidden_state` as a single global image representation (standard for ViT-style models).
  - Applied **L2-normalization** so cosine similarity becomes a fast dot product (stable and efficient retrieval).
- **Evaluation metric (retrieval-focused):** `label_agree@5` and `label_agree@10`
  - For each image, retrieve its **top-k nearest neighbors** (cosine similarity).
  - Measure the fraction of neighbors with the **same label** as the query.
  - Average across all 1,500 images.
  - This measures retrieval quality directly (not classifier accuracy).
- **Key results (quality + efficiency):**
  - **DINOv2-Small performed best:** `agree@5 β‰ˆ 0.9247`, `agree@10 β‰ˆ 0.9006`
  - Also produced **smaller embeddings** (384-dim) than CLIP/ViT (768-dim), reducing storage and improving retrieval efficiency.
  - Selected **DINOv2-Small** as the optimal embedding model.
- **Saved outputs (reusable):**
  - Embeddings: `*_embeddings.npy` (NumPy)
  - Metadata mapping: `*_metadata.csv` (CSV)
  - Comparison table: `embedding_model_comparison.csv` (CSV)
- **Qualitative validation:**
  - **PCA scatter plot** to visualize clustering in 2D (sanity check for overlap/separability).
  - **Nearest-neighbor gallery** to confirm retrieved results make sense visually and align with labels.

---

## Part 4 β€” End-to-End Pipeline (Retrieve + Generate)

- **Goal:** Build a production-style **Input β†’ Processing β†’ Output** pipeline that can be plugged directly into an app.  
  The user provides a satellite patch image plus a text prompt, and the system returns:
  1) **Most similar images from the dataset (retrieval)**  
  2) **Newly generated images** via **image-to-image** and **text-to-image**  
  with user-controlled counts (**0–5 each**).

- **System architecture: two engines working together**
  - **Retrieval engine (embedding-based):**
    - Embed the user image with **DINOv2-Small** (best model from Part 3).
    - Compare the query embedding against the stored embedding index:
      - `best_embeddings.npy` (vectors) + `best_metadata.csv` (filename/label mapping).
    - Compute similarity using **cosine similarity** (dot product due to L2 normalization).
    - Return **Top-K** results (K ≀ 5), each including image, label, similarity score, and filename.
  - **Generation engine (Diffusers):**
    - Use **`stabilityai/sd-turbo`** for fast generation (works well with 1–2 steps).
    - Support two generation modes:
      - **img2img:** generates variants that stay visually close to the user image, guided by the prompt.
      - **txt2img:** generates new images purely from the prompt.
    - User controls how many images to generate (0–5 each).

- **Pipeline inputs:**
  - `user_img` β€” user-provided PIL image
  - `user_prompt` β€” user-provided prompt (required for generation)
  - `k_retrieve` β€” number of retrieved images (0–5)
  - `n_i2i`, `n_t2i` β€” generated image counts (0–5 each)
  - `strength_i2i` β€” img2img closeness (lower = closer to input)
  - `steps` β€” generation steps (sd-turbo typically 1–2)
  - `gen_size` β€” output size (e.g., 384 or 512)
  - `seed` β€” reproducibility

- **Stability safeguards (app-ready):**
  - Hard caps on counts (**0–5**) for retrieval and generation to prevent overload.
  - A **safe-step rule** for img2img to avoid the β€œ0 effective steps” Diffusers crash when strength is low.
  - GPU optimizations when available: **fp16 + `torch.autocast`** for speed.

- **Reading the dataset directly from HF (course requirement):**
  - Instead of local files, dataset images are loaded using **`hf_hub_download`** from:
    - `LevyJonas/sat_land_patches`
  - A cache directory is used to avoid repeated downloads.

- **Pipeline outputs:**
  - `retrieved`: up to 5 retrieved items (PIL image, label, similarity, filename)
  - `gen_i2i`: up to 5 generated img2img images
  - `gen_t2i`: up to 5 generated txt2img images
  - `info`: summary dictionary (prompt, counts, steps/strength, dataset id, etc.)

- **Key takeaway:**
  - Part 4 combines **retrieval (real examples from the dataset)** with **generation (new synthetic variants)** in one workflow, and is modular/UI-ready for Part 5 (Gradio sliders + galleries).

---

## Part 5 β€” Application (HF Space with Gradio)

- **Goal:** Deploy an interactive application that demonstrates the full workflow:
  **Upload image + prompt β†’ retrieve similar examples β†’ generate new variants**.
  This turns the pipeline from Part 4 into a user-facing product-like demo.

- **Platform:** Hugging Face **Spaces** using **Gradio** (`app.py` as the entry point).

- **UI Inputs (user controls):**
  - **Image upload**: user provides a satellite patch (PIL image).
  - **Prompt textbox**: user writes the prompt (required for generation).
  - **Sliders (0–5)**:
    - `k_retrieve`: number of retrieved dataset images (0–5)
    - `n_i2i`: number of img2img generated images (0–5)
    - `n_t2i`: number of txt2img generated images (0–5)
  - **Generation settings**:
    - `strength_i2i`: controls how close img2img stays to the input (lower = closer)
    - `steps`: generation steps (1–2 recommended for sd-turbo)
    - `gen_size`: output size (384 or 512)
    - `seed`: reproducibility

- **Backend logic (connected to Part 4):**
  - `app.py` calls `run_search_and_generate(...)` from `pipeline.py`.
  - The pipeline:
    - Embeds the uploaded image (DINOv2-Small)
    - Retrieves Top-K similar images from the embedding index (`best_embeddings.npy` + `best_metadata.csv`)
    - Generates new images using `stabilityai/sd-turbo` with:
      - **img2img** conditioned on the uploaded image + prompt
      - **txt2img** conditioned on the prompt only

- **Outputs shown to the user:**
  - **Gallery 1 (Retrieved from dataset):** Top-K nearest neighbors with labels + cosine similarity scores.
  - **Gallery 2 (Generated img2img):** New image variants close to the uploaded input.
  - **Gallery 3 (Generated txt2img):** New images generated from the prompt.
  - **Summary panel:** displays the chosen parameters and pipeline metadata (counts, steps, strength, dataset id, etc.).

- **Course requirement: read directly from HF dataset repo**
  - Dataset images are loaded at runtime using `hf_hub_download` from:
    - `LevyJonas/sat_land_patches`
  - A local cache is used in the Space to avoid repeated downloads.

- **Deployment notes:**
  - For practical generation speed, the Space should run on **GPU** hardware.
  - Embedding files (`best_embeddings.npy`, `best_metadata.csv`) are stored in the Space repo so the app can start instantly.