YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Enhanced Explanations for Kvasir-VQA
This repository contains our process for generating textual and visual explanations on top of the original SimulaMet/Kvasir-VQA-x1 dataset. The work enhances standard VQA answers with grounded reasoning, clinical language, and region-linked visual cues.
Textual Explanation Augmentation
We extended the original SimulaMet/Kvasir-VQA-x1 dataset with additional signals:
- Natural VQA answers from SimulaMet/Kvasir-VQA-x1.
- Ground-truth explanations from SimulaMet-HOST/Kvasir-VQA.
- Visual descriptions generated by Gemma 27B, which captured contextual details of the images.
By combining these three sources for each image and question pair, we created enhanced explanations grounded in both natural responses and domain-specific cues.
Visual Explanation Augmentation
To complement textual reasoning, we linked region-based visual cues to answers:
- Used pseudo masks generated via prompt-guided segmentation (e.g., ClipSeg).
- Integrated existing polyp and instrument masks from Kvasir-SEG.
- Linked masks to related answers using metadata from SimulaMet/Kvasir-VQA-x1.
This allowed the model to ground its predictions in specific image regions (e.g., polyps, instruments, anatomical landmarks).
Training Details
We trained the Florence-2 model with LoRA fine-tuning in a three-stage pipeline, using a complexity-aware batching strategy.
- LoRA config:
r=128,a=256 - Tokens used:
<MedVQA> {question}→ Standard VQA task<MedVQA_EXPLAIN> {question} Explain in Detail→ Textual explanation task<REFERRING_EXPRESSION_SEGMENTATION>→ Segmentation task (masks converted to Florence-supported location tokens)
Dataset Partitioning
- The original SimulaMet/Kvasir-VQA-x1 dataset (including the test split) was partitioned into three disjoint batches based on question complexity levels:
- C1 = simple
- C2 = moderate
- C3 = complex
- Each batch preserved different ratios of complexities:
- Batch 1 → Mostly simple questions (C1-heavy)
- Batch 2 → Balanced mix (C2-heavy)
- Batch 3 → Mostly complex questions (C3-heavy)
This setup allowed the model to progress gradually from simple tasks to more complex reasoning.
Training Stages
The model was trained in three sequential stages, each time combining a new batch with the same augmented data (visual grounding masks + textual explanations):
- Stage 1: Train on Batch 1 + augmented data
- Stage 2: Train on Batch 2 + augmented data
- Stage 3: Train on Batch 3 + augmented data
Caption-Based Post-Processing
In addition to VQA answers and explanations, we appended an auto-generated caption using the <MORE_DETAILED_CAPTION> token.
- Interestingly, the model learned to produce better grounded captions after training, even though captioning was never explicitly part of the training objective.
- These captions serve as a natural clinical narrative to enrich explanations.
Example JSON Entry
Below is an example of the final output format combining all signals:
{
"val_id": 1002,
"img_id": "cl8k2u1s71gx30832hzj38n7w",
"question": "What colors are observed in the abnormal areas?",
"answer": "red, pink, and white lesions noted",
"textual_explanation": "The abnormality, a Paris Ip type polyp, is observed in multiple colors including red, pink, and white.\nOverall explaination of image: The image shows a single polyp located in the upper gastrointestinal tract. The polyp appears as a large, rounded shape with a red and pink coloration, and is classified as a Paris Ip type polyp. It is located towards the center of the image and is surrounded by a pinkish-red tissue.",
"visual_explanation": [
{
"type": "segmentation_mask",
"data": "visuals/_mask_1002.jpg",
"description": "Highlighted mask showing the region of interest supporting the answer."
}
],
"confidence_score": 0.9633524969772056
}
Example Mask for the above:
Confidence Calculation
For each generated explanation, we also estimate a confidence score based on the model’s decoding stability:
- At every decoding step, we compute the top-k probability mass (sum of probabilities of the k most likely tokens).
- This top-k mass reflects how concentrated the model’s belief is in its most likely continuations.
- We average these values across all generated tokens to get the final stability confidence score.
This score lies between 0 and 1, with higher values indicating that the model was consistently confident in its token predictions during explanation generation.
Summary
- Textual explanations = Fusion of natural VQA, ground-truth HOST, and visual descriptions.
- Visual explanations = Masks + segmentation linked to VQA metadata.
- Training = Florence-2 with LoRA, multi-task prompting.
- Post-processing = Appended auto-generated captions for better clinical context.
