You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Enhanced Explanations for Kvasir-VQA

This repository contains our process for generating textual and visual explanations on top of the original SimulaMet/Kvasir-VQA-x1 dataset. The work enhances standard VQA answers with grounded reasoning, clinical language, and region-linked visual cues.


Textual Explanation Augmentation

We extended the original SimulaMet/Kvasir-VQA-x1 dataset with additional signals:

  • Natural VQA answers from SimulaMet/Kvasir-VQA-x1.
  • Ground-truth explanations from SimulaMet-HOST/Kvasir-VQA.
  • Visual descriptions generated by Gemma 27B, which captured contextual details of the images.

By combining these three sources for each image and question pair, we created enhanced explanations grounded in both natural responses and domain-specific cues.
Figure 1: Textual Explanation Overview


Visual Explanation Augmentation

To complement textual reasoning, we linked region-based visual cues to answers:

  • Used pseudo masks generated via prompt-guided segmentation (e.g., ClipSeg).
  • Integrated existing polyp and instrument masks from Kvasir-SEG.
  • Linked masks to related answers using metadata from SimulaMet/Kvasir-VQA-x1.

This allowed the model to ground its predictions in specific image regions (e.g., polyps, instruments, anatomical landmarks).
Figure 2: Visual Grounding Pipeline


Training Details

We trained the Florence-2 model with LoRA fine-tuning in a three-stage pipeline, using a complexity-aware batching strategy.

  • LoRA config: r=128, a=256
  • Tokens used:
    • <MedVQA> {question} → Standard VQA task
    • <MedVQA_EXPLAIN> {question} Explain in Detail → Textual explanation task
    • <REFERRING_EXPRESSION_SEGMENTATION> → Segmentation task (masks converted to Florence-supported location tokens)

Dataset Partitioning

  • The original SimulaMet/Kvasir-VQA-x1 dataset (including the test split) was partitioned into three disjoint batches based on question complexity levels:
    • C1 = simple
    • C2 = moderate
    • C3 = complex
  • Each batch preserved different ratios of complexities:
    • Batch 1 → Mostly simple questions (C1-heavy)
    • Batch 2 → Balanced mix (C2-heavy)
    • Batch 3 → Mostly complex questions (C3-heavy)

This setup allowed the model to progress gradually from simple tasks to more complex reasoning.

Training Stages

The model was trained in three sequential stages, each time combining a new batch with the same augmented data (visual grounding masks + textual explanations):

  1. Stage 1: Train on Batch 1 + augmented data
  2. Stage 2: Train on Batch 2 + augmented data
  3. Stage 3: Train on Batch 3 + augmented data

Caption-Based Post-Processing

In addition to VQA answers and explanations, we appended an auto-generated caption using the <MORE_DETAILED_CAPTION> token.

  • Interestingly, the model learned to produce better grounded captions after training, even though captioning was never explicitly part of the training objective.
  • These captions serve as a natural clinical narrative to enrich explanations.

Example JSON Entry

Below is an example of the final output format combining all signals:

{
    "val_id": 1002,
    "img_id": "cl8k2u1s71gx30832hzj38n7w",
    "question": "What colors are observed in the abnormal areas?",
    "answer": "red, pink, and white lesions noted",
    "textual_explanation": "The abnormality, a Paris Ip type polyp, is observed in multiple colors including red, pink, and white.\nOverall explaination of image: The image shows a single polyp located in the upper gastrointestinal tract. The polyp appears as a large, rounded shape with a red and pink coloration, and is classified as a Paris Ip type polyp. It is located towards the center of the image and is surrounded by a pinkish-red tissue.",
    "visual_explanation": [
      {
        "type": "segmentation_mask",
        "data": "visuals/_mask_1002.jpg",
        "description": "Highlighted mask showing the region of interest supporting the answer."
      }
    ],
    "confidence_score": 0.9633524969772056
  }

Example Mask for the above:

Example Mask


Confidence Calculation

For each generated explanation, we also estimate a confidence score based on the model’s decoding stability:

  • At every decoding step, we compute the top-k probability mass (sum of probabilities of the k most likely tokens).
  • This top-k mass reflects how concentrated the model’s belief is in its most likely continuations.
  • We average these values across all generated tokens to get the final stability confidence score.

This score lies between 0 and 1, with higher values indicating that the model was consistently confident in its token predictions during explanation generation.

Summary

  • Textual explanations = Fusion of natural VQA, ground-truth HOST, and visual descriptions.
  • Visual explanations = Masks + segmentation linked to VQA metadata.
  • Training = Florence-2 with LoRA, multi-task prompting.
  • Post-processing = Appended auto-generated captions for better clinical context.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support