jquenum
/

LISAt-7b

PyTorch

llava

Model card Files Files and versions

xet

Community

jquenum commited on Feb 21, 2025

Commit

450088e

verified ·

1 Parent(s): 64c231f

Update README.md

Browse files

Files changed (1) hide show

README.md +94 -3

README.md CHANGED Viewed

@@ -1,3 +1,94 @@
----
-license: cc-by-nc-sa-4.0
----

+---
+license: cc-by-nc-sa-4.0
+---
+# Model Card: [LISAt]
+## Model Description
+LISAT (Language-Image Segmentation and Text) is a cutting-edge vision-language model (VLM) designed specifically for handling complex remote-sensing images. Unlike traditional segmentation models, which are limited to recognizing a pre-defined set of objects, LISAT can reason over intricate user queries that refer to multiple objects of interest. This allows LISAT to generate segmentation masks from complex and implicit query text, offering a more flexible and intuitive approach to image understanding.
+LISAT was trained on a new curated geospatial reasoning-segmentation dataset, GRES, which contains 27,615 annotations across 9,205 images, as well as a multi-modal geospatial pre-training dataset, PreGRES, with over 1 million QA pairs. These datasets allow LISAT to not only describe remote-sensing images and answer complex questions but also identify and segment specific objects within these images.
+Our model surpasses existing geospatial foundation models, such as RS-GPT4V, by over 10.04% (BLEU-4) on remote-sensing visual description tasks. On remote-sensing reasoning segmentation tasks, LISAT outperforms state-of-the-art open-domain models by an impressive 143.36% (gIoU).
+With its advanced reasoning capabilities and superior performance on geospatial tasks, LISAT is a powerful tool for anyone working with remote-sensing data, enabling more accurate and detailed analysis of complex visual information. relevant outputs across a variety of domains.
+## Model Details
+- **Model architecture**: Inspired by LISA (![Lai et al., 2024](https://arxiv.org/pdf/2308.00692)), LISAT integrates a multimodal large language model (LLM) with a segmentation model. Its architechture is shown below.
+![LISAT Model Architecture](https://huggingface.co/jquenum/LISAt-7b/resolve/main/model_architecture.png)
+- **Training data**: we introduce the Geospatial Reasoning Segmentation Dataset (GRES), a collection of vision and language data designed around
+remote-sensing applications. GRES consists of two core components: PreGRES, a dataset consisting of over 1M remote-sensing specific visual instruction-tuning Q/A pairs for pre-training geospatial models, and GRES, a semi-synthetic dataset specialized for reasoning segmentation of remote-sensing data and consisting of 9,205 images and 27,615 natural language queries/answers within those images. From this LISAt dataset, we generate train, test, and validation splits consisting of 7,205, 1,500, and 500 images respectively.
+- **mplementation Details**: LISAT and LISATPRE are trained on eight DGX A100 80GB GPUs. In the first stage, we pretrain LISATPRE (context length = 2048) using LoRA for 1 epoch on PreGRES with next-token prediction cross-entropy loss. We employ the AdamW optimizer with a learning rate of 3e−4 and a cosine-decay learning rate scheduler, setting the batch size to 2 and gradient accumulation
+steps to 6.
+In the second stage, we train LISAT using GRES, as well as two traditional natural image referring segmentation datasets, FP-Ref-COCO and ReasonSeg. LoRA is applied to LISATPRE , while the SAM decoder undergoes full fine-tuning. The learning rate is set to 3e−4, with all other configurations remaining the
+same. For the loss function, the weight for text generation loss (λtxt) and mask loss (λmask) is set to 1.0, while the binary cross-entropy loss (BCE) (λbce) and Dice loss (λdice) are assigned weights of 2.0 and 0.5, respectively. The total
+training time was approximately 12 hours on eight DGX A100 80GB GPUs.
+- **License**: cc-by-nc-sa-4.0
+## Comparative Performance of LISAT-7B on GRES
+The following table shows a comparison of LISAT-7B against LISA-7B and LISA-13B-Llama2-v1 on the GRES dataset across different object sizes. LISAT-7B consistently outperforms the baseline models, particularly in the Small object category.
+| **Model**      | **Object Size** | **cIoU**               | **gIoU**               |
+|----------------|-----------------|------------------------|------------------------|
+| **LISA-7B**    | All             | 0.122 ± 0.014          | 0.113 ± 0.007          |
+|                | Small           | 0.104 ± 0.022          | 0.062 ± 0.008          |
+|                | Large           | 0.157 ± 0.017          | 0.222 ± 0.013          |
+| **LISA-13B (llama2)** | All      | 0.122 ± 0.014          | 0.139 ± 0.006          |
+|                | Small           | 0.106 ± 0.016          | 0.089 ± 0.007          |
+|                | Large           | 0.148 ± 0.018          | 0.244 ± 0.019          |
+| **LISAT (Ours)** | All           | **0.245 ± 0.023**      | **0.275 ± 0.009**      |
+|                | Small           | **0.232 ± 0.024**      | **0.240 ± 0.009**      |
+|                | Large           | **0.250 ± 0.029**      | **0.348 ± 0.015**      |
+The bolded values represent the best results in each category.
+## Model Usage
+Starting with `transformers` version >= 4.45.0, you can run **conversational inference** using the Transformers pipeline abstraction or by leveraging the Auto classes with the `generate()` function.
+To use LISAT-7B with transformers, make sure to update your transformers installation to the latest version using:
+```bash
+pip install --upgrade transformers
+Once your installation is updated, you can use LISAT-7B for inference as follows:
+from transformers import AutoModelForImageSegmentation, AutoTokenizer
+# Load model and tokenizer
+model = AutoModelForImageSegmentation.from_pretrained("jquenum/LISAt-7b")
+tokenizer = AutoTokenizer.from_pretrained("jquenum/LISAt-7b")
+# Example usage for inference
+input_image = "path/to/your/image.png"  # Replace with your input image
+inputs = tokenizer(input_image, return_tensors="pt")
+# Generate segmentation or other tasks
+outputs = model.generate(**inputs)
+## Intended Use
+### Intended Use Cases
+LISAT-7B is intended for both **commercial** and **research** use, specifically in **geospatial** and **remote-sensing image analysis** tasks. The model is designed to generate segmentation masks, provide visual descriptions, and answer complex queries about remote-sensing images. It excels in reasoning over implicit queries referring to multiple objects, enabling more advanced analysis compared to traditional segmentation models. LISAT-7B is suitable for tasks in fields such as environmental monitoring, urban planning, and disaster response, where high-quality, detailed analysis of remote-sensing imagery is required.
+Additionally, LISAT-7B can be adapted for various **natural language processing** tasks, such as generating captions, answering questions, or improving other models through synthetic data generation and fine-tuning. The LISAT-7B model is licensed under the **LISAT Community License**, allowing for these use cases in a responsible manner.
+### Out-of-scope Use
+LISAT-7B is **not** intended for use in any applications that violate applicable laws or regulations, including trade compliance and data protection laws. Use in any way that violates the **Acceptable Use Policy** or the **LISAT Community License** is prohibited. Additionally, any use of LISAT-7B beyond the supported **remote-sensing imagery** tasks outlined in this model card, including use in unsupported domains or languages, is not permitted.
+## Example Template
+To help users understand the model's directory structure, here's an example of the files and their sizes as shown in the repository: