Spaces:

derektan95
/

LISA-AVS-demo

Running on Zero

App Files Files Community

derektan commited on Sep 14, 2025

Commit

1496342

1 Parent(s): 90eae67

Updated description

Browse files

Files changed (1) hide show

app.py +11 -11

app.py CHANGED Viewed

@@ -158,12 +158,12 @@ model.eval()
 # Gradio
 examples = [
     [
-        "Where can I find the shore birds (Animalia Chordata Aves Charadriiformes Laridae Larus marinus) in this image? Please output segmentation mask and explain why.",
         "./imgs/examples/Animalia_Chordata_Aves_Charadriiformes_Laridae_Larus_marinus/80645_39.76079_-74.10316.jpg",
     ],
     [
-        "Where can I find the capybaras (Animalia Chordata Mammalia Rodentia Caviidae Hydrochoerus hydrochaeris) in this image? Please output segmentation mask.",
         "./imgs/examples/Animalia_Chordata_Mammalia_Rodentia_Caviidae_Hydrochoerus_hydrochaeris/28871_-12.80255_-69.29999.jpg",
     ],
 ]
 output_labels = ["Segmentation Output"]
@@ -172,14 +172,14 @@ title = "LISA-AVS: LISA 7B Model Finetuned on AVS-Bench Dataset"
 description = """
 <font size=4>
-Note: This is an adapted version of the online demo for LISA, where we finetune from scratch the LISA model (7B) with data from AVS-Bench (Search-TTA). \n
-If multiple users are using it at the same time, they will enter a queue, which may delay some time. \n
-**Note**: **Different prompts can lead to significantly varied results**. \n
-**Note**: Please try to **standardize** your input text prompts to **avoid ambiguity**, and also pay attention to whether the **punctuations** of the input are correct. \n
 **Usage**: <br>
-&ensp;(1) To let LISA-AVS **segment something**, input prompt like: "Where can I find the <Common Name> (<Full Taxonomy Name>) in this image? Please output segmentation mask."; <br>
-&ensp;(2) To let LISA-AVS **output an explanation**, input prompt like: "Where can I find the <Common Name> (<Full Taxonomy Name>) in this image? Please output segmentation mask and explain why."; <br>
-&ensp;(3) To obtain **solely language output**, you can input like what you should do in current multi-modal LLM (e.g., LLaVA), like: "Where can I find the <Common Name> (<Full Taxonomy Name>) in this image?" <br>
 </font>
 """
@@ -202,7 +202,7 @@ AVS-Bench
 ## to be implemented
 @spaces.GPU
-def inference(input_str, input_image):
     ## filter out special chars
     input_str = bleach.clean(input_str)
@@ -338,8 +338,8 @@ def inference(input_str, input_image):
 demo = gr.Interface(
     inference,
     inputs=[
-        gr.Textbox(lines=1, placeholder=None, label="Text Instruction"),
         gr.Image(type="filepath", label="Input Image"),
     ],
     outputs=[
         gr.Image(type="pil", label="Segmentation Output"),

 # Gradio
 examples = [
     [
         "./imgs/examples/Animalia_Chordata_Aves_Charadriiformes_Laridae_Larus_marinus/80645_39.76079_-74.10316.jpg",
+        "Where can I find the shore birds (Animalia Chordata Aves Charadriiformes Laridae Larus marinus) in this image? Please output segmentation mask and explain why.",
     ],
     [
         "./imgs/examples/Animalia_Chordata_Mammalia_Rodentia_Caviidae_Hydrochoerus_hydrochaeris/28871_-12.80255_-69.29999.jpg",
+        "Where can I find the capybaras (Animalia Chordata Mammalia Rodentia Caviidae Hydrochoerus hydrochaeris) in this image? Please output segmentation mask.",
     ],
 ]
 output_labels = ["Segmentation Output"]
 description = """
 <font size=4>
+This is an adapted version of the online demo for <a href='https://github.com/dvlab-research/LISA' target='_blank'>LISA</a>, where we finetune from scratch the LISA model (7B) with data from <a href='https://search-tta.github.io/' target='_blank'>AVS-Bench (Search-TTA)</a>. \n
+**Note**:  <br>
+&ensp;(a) If multiple users are using it at the same time, they will enter a queue, which may delay some time. \n
+&ensp;(b) Different prompts can lead to significantly varied results. Please **standardize** your input text prompts to **avoid ambiguity**, and pay attention to whether the **punctuations** of the input are correct. \n
 **Usage**: <br>
+&ensp;(1) To let LISA-AVS **segment something**, input prompt like: "Where can I find the <em>Common Name</em> (<em>Full Taxonomy Name</em>) in this image? Please output segmentation mask."; <br>
+&ensp;(2) To let LISA-AVS **output an explanation**, input prompt like: "Where can I find the <em>Common Name</em> (<em>Full Taxonomy Name</em>) in this image? Please output segmentation mask and explain why."; <br>
+&ensp;(3) To obtain **solely language output**, you can input like what you should do in current multi-modal LLM (e.g., LLaVA), like: "Where can I find the <em>Common Name</em> (<em>Full Taxonomy Name</em>) in this image?" <br>
 </font>
 """
 ## to be implemented
 @spaces.GPU
+def inference(input_image, input_str):
     ## filter out special chars
     input_str = bleach.clean(input_str)
 demo = gr.Interface(
     inference,
     inputs=[
         gr.Image(type="filepath", label="Input Image"),
+        gr.Textbox(lines=1, placeholder=None, label="Text Instruction"),
     ],
     outputs=[
         gr.Image(type="pil", label="Segmentation Output"),