LevyJonas's picture
Update README.md
8afa490 verified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: Satellite Patch Retrieve + Generate
emoji: πŸ›°οΈ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.50.0
python_version: '3.10'
app_file: app.py
pinned: false

Final Project Summary (Satellite Patch Retrieval + Generation)

This document summarizes Parts 1–4 of our project: dataset generation, EDA, embeddings, and the end-to-end pipeline.


Part 1 β€” Synthetic Data Generation (with key terms)

In Part 1, we built a synthetic satellite-like image dataset using a pre-trained Hugging Face generative model. We used stabilityai/sd-turbo (a fast Stable Diffusion β€œTurbo” model) to generate 30 land-type classes with 50 images per class (1500 images total). Each label had its own prompt (e.g., forest, water, urban, runway), and we used a negative prompt to reduce unwanted artifacts such as text, logos, or cartoonish styles. The images were saved in a clean folder structure (images/<label>/...jpg) and documented in metadata.csv (id, filename, label, prompt, seed, model_id) so later parts (EDA, embeddings, and the app) could load and reuse the dataset easily.

Key terms used

  • Diffusers: Hugging Face library providing ready-to-use pipelines for diffusion-based generative models (e.g., Stable Diffusion). It loads the model and generates images from prompts.
  • Transformers: Hugging Face library for Transformer-based models across text and vision. Used both as a dependency and later for embedding models (CLIP/ViT/DINOv2).
  • Tokenizers: Converts text prompts into tokens/IDs the model can process; required for text-conditioned models (e.g., text-to-image).
  • Pillow (PIL): Python imaging library for loading/manipulating/saving images (JPG/PNG), resizing, and file I/O.
  • stabilityai/sd-turbo: Chosen because it is optimized for speed and can generate strong results with 1–2 inference steps, enabling fast large-scale dataset creation.

Part 2 β€” Exploratory Data Analysis (EDA)

  • Loaded and inspected metadata: Read metadata.csv (1500 rows) with expected columns (id, filename, label, prompt, seed, model_id) and confirmed 30 classes.
  • Integrity validation: Verified 0 missing image files, 0 duplicate ids, 0 duplicate filenames, and 0 duplicate (label, seed) pairs.
  • Class balance check: Confirmed a perfectly balanced dataset with 50 images per label (min/max = 50/50).
  • Image consistency: Confirmed all images have the same resolution (384Γ—384).
  • Global image statistics: Computed per-image RGB mean/std, brightness (luminance proxy), and a sharpness proxy (gradient-based), then reviewed distributions and summaries.
  • Outlier analysis: Observed meaningful extremes consistent with labels:
    • darkest samples mainly DenseForest
    • brightest samples mainly SnowIce
    • lowest-sharpness samples often from smoother-texture classes like Grassland / DesertSand / SeaOpenWater
  • Class-level insights: Aggregated statistics by label (brightness/color tendencies) and used a simple PCA projection to visualize similarity/overlap between visually related classes.

Part 3 β€” Embeddings (Similarity Search)

  • Goal: Convert each satellite patch image into a compact vector (embedding) to enable similarity search / retrieval and support the later app pipeline.
  • Models tested (HF backbones):
    • CLIP ViT-B/32 (openai/clip-vit-base-patch32)
    • ViT-Base (google/vit-base-patch16-224-in21k)
    • DINOv2-Small (facebook/dinov2-small)
  • Embedding extraction:
    • Used the CLS token from last_hidden_state as a single global image representation (standard for ViT-style models).
    • Applied L2-normalization so cosine similarity becomes a fast dot product (stable and efficient retrieval).
  • Evaluation metric (retrieval-focused): label_agree@5 and label_agree@10
    • For each image, retrieve its top-k nearest neighbors (cosine similarity).
    • Measure the fraction of neighbors with the same label as the query.
    • Average across all 1,500 images.
    • This measures retrieval quality directly (not classifier accuracy).
  • Key results (quality + efficiency):
    • DINOv2-Small performed best: agree@5 β‰ˆ 0.9247, agree@10 β‰ˆ 0.9006
    • Also produced smaller embeddings (384-dim) than CLIP/ViT (768-dim), reducing storage and improving retrieval efficiency.
    • Selected DINOv2-Small as the optimal embedding model.
  • Saved outputs (reusable):
    • Embeddings: *_embeddings.npy (NumPy)
    • Metadata mapping: *_metadata.csv (CSV)
    • Comparison table: embedding_model_comparison.csv (CSV)
  • Qualitative validation:
    • PCA scatter plot to visualize clustering in 2D (sanity check for overlap/separability).
    • Nearest-neighbor gallery to confirm retrieved results make sense visually and align with labels.

Part 4 β€” End-to-End Pipeline (Retrieve + Generate)

  • Goal: Build a production-style Input β†’ Processing β†’ Output pipeline that can be plugged directly into an app.
    The user provides a satellite patch image plus a text prompt, and the system returns:

    1. Most similar images from the dataset (retrieval)
    2. Newly generated images via image-to-image and text-to-image
      with user-controlled counts (0–5 each).
  • System architecture: two engines working together

    • Retrieval engine (embedding-based):
      • Embed the user image with DINOv2-Small (best model from Part 3).
      • Compare the query embedding against the stored embedding index:
        • best_embeddings.npy (vectors) + best_metadata.csv (filename/label mapping).
      • Compute similarity using cosine similarity (dot product due to L2 normalization).
      • Return Top-K results (K ≀ 5), each including image, label, similarity score, and filename.
    • Generation engine (Diffusers):
      • Use stabilityai/sd-turbo for fast generation (works well with 1–2 steps).
      • Support two generation modes:
        • img2img: generates variants that stay visually close to the user image, guided by the prompt.
        • txt2img: generates new images purely from the prompt.
      • User controls how many images to generate (0–5 each).
  • Pipeline inputs:

    • user_img β€” user-provided PIL image
    • user_prompt β€” user-provided prompt (required for generation)
    • k_retrieve β€” number of retrieved images (0–5)
    • n_i2i, n_t2i β€” generated image counts (0–5 each)
    • strength_i2i β€” img2img closeness (lower = closer to input)
    • steps β€” generation steps (sd-turbo typically 1–2)
    • gen_size β€” output size (e.g., 384 or 512)
    • seed β€” reproducibility
  • Stability safeguards (app-ready):

    • Hard caps on counts (0–5) for retrieval and generation to prevent overload.
    • A safe-step rule for img2img to avoid the β€œ0 effective steps” Diffusers crash when strength is low.
    • GPU optimizations when available: fp16 + torch.autocast for speed.
  • Reading the dataset directly from HF (course requirement):

    • Instead of local files, dataset images are loaded using hf_hub_download from:
      • LevyJonas/sat_land_patches
    • A cache directory is used to avoid repeated downloads.
  • Pipeline outputs:

    • retrieved: up to 5 retrieved items (PIL image, label, similarity, filename)
    • gen_i2i: up to 5 generated img2img images
    • gen_t2i: up to 5 generated txt2img images
    • info: summary dictionary (prompt, counts, steps/strength, dataset id, etc.)
  • Key takeaway:

    • Part 4 combines retrieval (real examples from the dataset) with generation (new synthetic variants) in one workflow, and is modular/UI-ready for Part 5 (Gradio sliders + galleries).

Part 5 β€” Application (HF Space with Gradio)

  • Goal: Deploy an interactive application that demonstrates the full workflow: Upload image + prompt β†’ retrieve similar examples β†’ generate new variants. This turns the pipeline from Part 4 into a user-facing product-like demo.

  • Platform: Hugging Face Spaces using Gradio (app.py as the entry point).

  • UI Inputs (user controls):

    • Image upload: user provides a satellite patch (PIL image).
    • Prompt textbox: user writes the prompt (required for generation).
    • Sliders (0–5):
      • k_retrieve: number of retrieved dataset images (0–5)
      • n_i2i: number of img2img generated images (0–5)
      • n_t2i: number of txt2img generated images (0–5)
    • Generation settings:
      • strength_i2i: controls how close img2img stays to the input (lower = closer)
      • steps: generation steps (1–2 recommended for sd-turbo)
      • gen_size: output size (384 or 512)
      • seed: reproducibility
  • Backend logic (connected to Part 4):

    • app.py calls run_search_and_generate(...) from pipeline.py.
    • The pipeline:
      • Embeds the uploaded image (DINOv2-Small)
      • Retrieves Top-K similar images from the embedding index (best_embeddings.npy + best_metadata.csv)
      • Generates new images using stabilityai/sd-turbo with:
        • img2img conditioned on the uploaded image + prompt
        • txt2img conditioned on the prompt only
  • Outputs shown to the user:

    • Gallery 1 (Retrieved from dataset): Top-K nearest neighbors with labels + cosine similarity scores.
    • Gallery 2 (Generated img2img): New image variants close to the uploaded input.
    • Gallery 3 (Generated txt2img): New images generated from the prompt.
    • Summary panel: displays the chosen parameters and pipeline metadata (counts, steps, strength, dataset id, etc.).
  • Course requirement: read directly from HF dataset repo

    • Dataset images are loaded at runtime using hf_hub_download from:
      • LevyJonas/sat_land_patches
    • A local cache is used in the Space to avoid repeated downloads.
  • Deployment notes:

    • For practical generation speed, the Space should run on GPU hardware.
    • Embedding files (best_embeddings.npy, best_metadata.csv) are stored in the Space repo so the app can start instantly.