RSEdit: Text-Guided Image Editing for Remote Sensing

RSEdit is a unified framework that adapts pretrained text-to-image diffusion models into instruction-following editors for remote sensing (RS) imagery. By addressing the gap in RS world knowledge and misalignment in conditioning schemes, RSEdit achieves precise, physically coherent edits while preserving geospatial content across scenarios like urban growth, disaster impacts, and seasonal shifts.

[Paper] [Code] [Project Page]

RSEdit-UNet Text Encoder Ablation Models

This repository contains the UNet-based ablation models (text encoder variants) for RSEdit. These models use the standard InstructPix2Pix pipeline structure.

Quick Start

To generate an edited image using a pre-trained RSEdit UNet ablation model, you can use the diffusers library:

import torch
from PIL import Image
from diffusers import StableDiffusionInstructPix2PixPipeline, UNet2DConditionModel

# Example: DGTRS-CLIP-ViT-L-14 ablation model
# Each variant directory is self-contained with all components
checkpoint_path = "BiliSakura/RSEdit-UNet-text-ablation" 
variant = "DGTRS-CLIP-ViT-L-14"

# Load pipeline from checkpoint
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
    checkpoint_path,
    subfolder=variant,
    torch_dtype=torch.bfloat16,
    safety_checker=None,
    requires_safety_checker=False,
)

# Optional: Override UNet with trained EMA weights if specifically required
# pipe.unet = UNet2DConditionModel.from_pretrained(
#     f"{checkpoint_path}/{variant}/checkpoint-30000/unet_ema",
#     torch_dtype=torch.bfloat16,
# )

pipe = pipe.to("cuda")

# Load source satellite image
source_image = Image.open("satellite_image.png").convert("RGB")

# Edit with instruction
prompt = "Flood the coastal area"
edited_image = pipe(
    prompt=prompt,
    image=source_image,
    num_inference_steps=50,
    guidance_scale=7.5,
    image_guidance_scale=1.5,
).images[0]

# Save result
edited_image.save("edited_image.png")

Model Structure

Each ablation model directory is self-contained and includes:

  • text_encoder/: The specific text encoder variant (e.g., CLIP, DGTRS).
  • tokenizer/: Associated tokenizer.
  • vae/: VAE component.
  • scheduler/: PNDM scheduler.
  • unet/: Base UNet weights.
  • checkpoint-30000/unet_ema/: Trained UNet EMA weights optimized for RS editing.

Citation

If you find this work useful, please cite:

@misc{zhenyuan2026rsedittextguidedimageediting,
      title={RSEdit: Text-Guided Image Editing for Remote Sensing}, 
      author={Chen Zhenyuan and Zhang Zechuan and Zhang Feng},
      year={2026},
      eprint={2603.13708},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.13708}, 
}

Acknowledgments

This project builds upon Diffusers, Accelerate, and Stable Diffusion.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for BiliSakura/RSEdit-UNet-text-ablation