Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.5.1
title: Satellite Patch Retrieve + Generate
emoji: π°οΈ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.50.0
python_version: '3.10'
app_file: app.py
pinned: false
Final Project Summary (Satellite Patch Retrieval + Generation)
This document summarizes Parts 1β4 of our project: dataset generation, EDA, embeddings, and the end-to-end pipeline.
Part 1 β Synthetic Data Generation (with key terms)
In Part 1, we built a synthetic satellite-like image dataset using a pre-trained Hugging Face generative model. We used stabilityai/sd-turbo (a fast Stable Diffusion βTurboβ model) to generate 30 land-type classes with 50 images per class (1500 images total). Each label had its own prompt (e.g., forest, water, urban, runway), and we used a negative prompt to reduce unwanted artifacts such as text, logos, or cartoonish styles. The images were saved in a clean folder structure (images/<label>/...jpg) and documented in metadata.csv (id, filename, label, prompt, seed, model_id) so later parts (EDA, embeddings, and the app) could load and reuse the dataset easily.
Key terms used
- Diffusers: Hugging Face library providing ready-to-use pipelines for diffusion-based generative models (e.g., Stable Diffusion). It loads the model and generates images from prompts.
- Transformers: Hugging Face library for Transformer-based models across text and vision. Used both as a dependency and later for embedding models (CLIP/ViT/DINOv2).
- Tokenizers: Converts text prompts into tokens/IDs the model can process; required for text-conditioned models (e.g., text-to-image).
- Pillow (PIL): Python imaging library for loading/manipulating/saving images (JPG/PNG), resizing, and file I/O.
stabilityai/sd-turbo: Chosen because it is optimized for speed and can generate strong results with 1β2 inference steps, enabling fast large-scale dataset creation.
Part 2 β Exploratory Data Analysis (EDA)
- Loaded and inspected metadata: Read
metadata.csv(1500 rows) with expected columns (id,filename,label,prompt,seed,model_id) and confirmed 30 classes. - Integrity validation: Verified 0 missing image files, 0 duplicate ids, 0 duplicate filenames, and 0 duplicate (label, seed) pairs.
- Class balance check: Confirmed a perfectly balanced dataset with 50 images per label (min/max = 50/50).
- Image consistency: Confirmed all images have the same resolution (384Γ384).
- Global image statistics: Computed per-image RGB mean/std, brightness (luminance proxy), and a sharpness proxy (gradient-based), then reviewed distributions and summaries.
- Outlier analysis: Observed meaningful extremes consistent with labels:
- darkest samples mainly DenseForest
- brightest samples mainly SnowIce
- lowest-sharpness samples often from smoother-texture classes like Grassland / DesertSand / SeaOpenWater
- Class-level insights: Aggregated statistics by label (brightness/color tendencies) and used a simple PCA projection to visualize similarity/overlap between visually related classes.
Part 3 β Embeddings (Similarity Search)
- Goal: Convert each satellite patch image into a compact vector (embedding) to enable similarity search / retrieval and support the later app pipeline.
- Models tested (HF backbones):
- CLIP ViT-B/32 (
openai/clip-vit-base-patch32) - ViT-Base (
google/vit-base-patch16-224-in21k) - DINOv2-Small (
facebook/dinov2-small)
- CLIP ViT-B/32 (
- Embedding extraction:
- Used the CLS token from
last_hidden_stateas a single global image representation (standard for ViT-style models). - Applied L2-normalization so cosine similarity becomes a fast dot product (stable and efficient retrieval).
- Used the CLS token from
- Evaluation metric (retrieval-focused):
label_agree@5andlabel_agree@10- For each image, retrieve its top-k nearest neighbors (cosine similarity).
- Measure the fraction of neighbors with the same label as the query.
- Average across all 1,500 images.
- This measures retrieval quality directly (not classifier accuracy).
- Key results (quality + efficiency):
- DINOv2-Small performed best:
agree@5 β 0.9247,agree@10 β 0.9006 - Also produced smaller embeddings (384-dim) than CLIP/ViT (768-dim), reducing storage and improving retrieval efficiency.
- Selected DINOv2-Small as the optimal embedding model.
- DINOv2-Small performed best:
- Saved outputs (reusable):
- Embeddings:
*_embeddings.npy(NumPy) - Metadata mapping:
*_metadata.csv(CSV) - Comparison table:
embedding_model_comparison.csv(CSV)
- Embeddings:
- Qualitative validation:
- PCA scatter plot to visualize clustering in 2D (sanity check for overlap/separability).
- Nearest-neighbor gallery to confirm retrieved results make sense visually and align with labels.
Part 4 β End-to-End Pipeline (Retrieve + Generate)
Goal: Build a production-style Input β Processing β Output pipeline that can be plugged directly into an app.
The user provides a satellite patch image plus a text prompt, and the system returns:- Most similar images from the dataset (retrieval)
- Newly generated images via image-to-image and text-to-image
with user-controlled counts (0β5 each).
System architecture: two engines working together
- Retrieval engine (embedding-based):
- Embed the user image with DINOv2-Small (best model from Part 3).
- Compare the query embedding against the stored embedding index:
best_embeddings.npy(vectors) +best_metadata.csv(filename/label mapping).
- Compute similarity using cosine similarity (dot product due to L2 normalization).
- Return Top-K results (K β€ 5), each including image, label, similarity score, and filename.
- Generation engine (Diffusers):
- Use
stabilityai/sd-turbofor fast generation (works well with 1β2 steps). - Support two generation modes:
- img2img: generates variants that stay visually close to the user image, guided by the prompt.
- txt2img: generates new images purely from the prompt.
- User controls how many images to generate (0β5 each).
- Use
- Retrieval engine (embedding-based):
Pipeline inputs:
user_imgβ user-provided PIL imageuser_promptβ user-provided prompt (required for generation)k_retrieveβ number of retrieved images (0β5)n_i2i,n_t2iβ generated image counts (0β5 each)strength_i2iβ img2img closeness (lower = closer to input)stepsβ generation steps (sd-turbo typically 1β2)gen_sizeβ output size (e.g., 384 or 512)seedβ reproducibility
Stability safeguards (app-ready):
- Hard caps on counts (0β5) for retrieval and generation to prevent overload.
- A safe-step rule for img2img to avoid the β0 effective stepsβ Diffusers crash when strength is low.
- GPU optimizations when available: fp16 +
torch.autocastfor speed.
Reading the dataset directly from HF (course requirement):
- Instead of local files, dataset images are loaded using
hf_hub_downloadfrom:LevyJonas/sat_land_patches
- A cache directory is used to avoid repeated downloads.
- Instead of local files, dataset images are loaded using
Pipeline outputs:
retrieved: up to 5 retrieved items (PIL image, label, similarity, filename)gen_i2i: up to 5 generated img2img imagesgen_t2i: up to 5 generated txt2img imagesinfo: summary dictionary (prompt, counts, steps/strength, dataset id, etc.)
Key takeaway:
- Part 4 combines retrieval (real examples from the dataset) with generation (new synthetic variants) in one workflow, and is modular/UI-ready for Part 5 (Gradio sliders + galleries).
Part 5 β Application (HF Space with Gradio)
Goal: Deploy an interactive application that demonstrates the full workflow: Upload image + prompt β retrieve similar examples β generate new variants. This turns the pipeline from Part 4 into a user-facing product-like demo.
Platform: Hugging Face Spaces using Gradio (
app.pyas the entry point).UI Inputs (user controls):
- Image upload: user provides a satellite patch (PIL image).
- Prompt textbox: user writes the prompt (required for generation).
- Sliders (0β5):
k_retrieve: number of retrieved dataset images (0β5)n_i2i: number of img2img generated images (0β5)n_t2i: number of txt2img generated images (0β5)
- Generation settings:
strength_i2i: controls how close img2img stays to the input (lower = closer)steps: generation steps (1β2 recommended for sd-turbo)gen_size: output size (384 or 512)seed: reproducibility
Backend logic (connected to Part 4):
app.pycallsrun_search_and_generate(...)frompipeline.py.- The pipeline:
- Embeds the uploaded image (DINOv2-Small)
- Retrieves Top-K similar images from the embedding index (
best_embeddings.npy+best_metadata.csv) - Generates new images using
stabilityai/sd-turbowith:- img2img conditioned on the uploaded image + prompt
- txt2img conditioned on the prompt only
Outputs shown to the user:
- Gallery 1 (Retrieved from dataset): Top-K nearest neighbors with labels + cosine similarity scores.
- Gallery 2 (Generated img2img): New image variants close to the uploaded input.
- Gallery 3 (Generated txt2img): New images generated from the prompt.
- Summary panel: displays the chosen parameters and pipeline metadata (counts, steps, strength, dataset id, etc.).
Course requirement: read directly from HF dataset repo
- Dataset images are loaded at runtime using
hf_hub_downloadfrom:LevyJonas/sat_land_patches
- A local cache is used in the Space to avoid repeated downloads.
- Dataset images are loaded at runtime using
Deployment notes:
- For practical generation speed, the Space should run on GPU hardware.
- Embedding files (
best_embeddings.npy,best_metadata.csv) are stored in the Space repo so the app can start instantly.