PANEL: A Domain-Specific Vision-Language Model for Photovoltaic Tasks in Remote Sensing
Model Description
PANEL (PV-specific vision-lANguage modEL) is a domain-specific Vision-Language Model (VLM) tailored for large-scale photovoltaic (PV) mapping and interpretation in remote sensing (RS). Built upon the CLIP (ViT-B/16) architecture, PANEL is pre-trained on a curated worldwide PV vision-language dataset comprising over one million image-text pairs. It effectively aligns visual features of PV panels with diverse text prompts, enabling robust performance across varying spatial resolutions (0.1m to 20m) and complex urban/rural contexts.
PANEL is designed for:
- Zero-shot interpretation: Identifying and localizing PV panels without task-specific fine-tuning.
- Few-shot adaptation: Rapidly adapting to downstream tasks (e.g., segmentation) with minimal labels via the Knowledge Assistance Module (KAM).
How to Use
1. Zero-Shot Inference (Semantic Localization Example)
For zero-shot tasks, PANEL leverages PANEL Surgery to reinforce vision-language alignment. Below is a simplified example of performing semantic localization to generate a similarity map for PV panels. For classification, similarity can be calculated using the cls_token.
import torch
import panel
from PIL import Image
from torchvision.transforms import Compose, ToTensor, Normalize
# 1. Load the pre-trained PANEL model
# The 'panel' library should be installed/available in your environment
model_path = "PANEL-ViT-B-16_ImgSize256.pth"
model, _ = panel.load4panel(model_path, custom_resolution=256, device='cuda')
model.eval().to('cuda')
# 2. Prepare text prompts (Ensemble of PV-related terms)
target_texts = ["PV panels", "solar panels", "photovoltaic modules", "solar arrays"]
prompt_templates = ['a remote sensing image of {}', 'a satellite imagery of {}', 'an aerial image of {}']
# Encode text features with prompt ensemble and remove redundant features (Surgery)
text_features = panel.encode_text_with_prompt_ensemble(model, target_texts, 'cuda', prompt_templates)
redundant_features = panel.encode_text_with_prompt_ensemble(model, [""], 'cuda', prompt_templates)
valuable_text_features = text_features - redundant_features
# 3. Prepare image
preprocess = Compose([ToTensor(), Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))])
image = Image.open("your_pv_image.tif").convert("RGB")
image_tensor = preprocess(image).unsqueeze(0).to('cuda')
# 4. Inference
with torch.no_grad():
# Extract patch-level image features
image_features, _ = model.encode_image(image_tensor) # [B, L, D]
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
# Compute similarity map
similarity = (image_features @ valuable_text_features.t())[:, 1:, :]
similarity_map = panel.get_similarity_map(similarity, (image_tensor.shape[1], image_tensor.shape[2]))
# similarity_map now contains the localization priors for PV panels
print("Similarity map generated:", similarity_map.shape)
2. Few-Shot Adaptation (with KAM)
For few-shot segmentation, we provide the Knowledge Assistance Module (KAM) to inject PANEL's vision-language priors into baseline models.
Integration with mmsegmentation:
To use KAM with a backbone (e.g., MiT), place the provided mit_kam.py file into the mmseg/models/backbones/ directory. KAM interacts with the baseline via gated convolution and cross-attention fusion.
# Example configuration snippet for mmsegmentation
model = dict(
type='EncoderDecoder',
backbone=dict(
type='MixVisionTransformerKAM', # Integrated with KAM
panel_priors=True, # Enable PANEL prior injection
pretrained='PANEL-ViT-B-16_ImgSize256.pth',
...),
decode_head=dict(...),
)
Citation
If you use PANEL in your research, please cite the following paper:
@article{deng2026panel,
title={PANEL: A Domain-Specific Vision-Language Model for Zero-Shot and Few-Shot Photovoltaic Tasks in Remote Sensing (Under Review)},
author={Deng, Ruizhe and Guo, Zhiling and Zhang, Penglei and Li, Jiaze and Xu, Xin and Chen, Qi and Chen, Yuntian and Yan, Jinyue},
journal={ISPRS Journal of Photogrammetry and Remote Sensing},
year={2026},
publisher={Elsevier}
}
Acknowledgements
This work was supported by the International Centre of Urban Energy Nexux (UEX) at The Hong Kong Polytechnic University and Eastern Institute of Technology (Ningbo). We thank the authors of the original CLIP model and CLIP Surgery for their foundational work.
Collection
This dataset is part of the UEX-RenewableEnergy Collection.