SurgViVQA-Audio: Audio-Adapted Qwen2-VL for Surgical Video QA
This repository contains the weights for SurgViVQA-Audio, presented in the paper SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding.
This model adapts a Whisper audio encoder to feed directly into Qwen2-VL, enabling hands-free surgical video question answering without intermediate ASR transcription.
⚠️ Research prototype only. Not for clinical use.
Model Description
- Paper: SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
- Base Vision-Language Model: Qwen2-VL-7B-Instruct
- Audio Encoder: Whisper Large v3 Turbo (frozen)
- Audio Projector: Linear layer (1280 → 3584) mapping Whisper features to Qwen2-VL embedding space
- Training Method: QLoRA (4-bit base model + BF16 LoRA adapters)
- Domain: Colonoscopy surgical procedures
- Code: GitHub Repository
Why Skip ASR?
Standard pipelines (Audio → ASR → Text → LLM) add latency and propagate transcription errors. By injecting audio embeddings directly into the vision-language model, this approach:
- 2.5× faster inference (0.9s vs 2.3s end-to-end)
- Measured on single-sample inference on 1× RTX 4090, same preprocessing and prompt length for both pipelines.
- Avoids ASR text errors and reduces error propagation from mis-transcriptions
Intended Use
Appropriate uses:
- Research on multimodal medical AI
- Benchmarking audio-visual question answering
- Exploring hands-free interfaces for surgical assistance
- Educational demonstrations of VLM fine-tuning
Out of scope:
- Clinical decision-making or diagnosis
- Real patient data processing
- Production deployment without extensive validation
- Any use requiring regulatory approval
Privacy Note
No patient audio was used in training. All audio was synthetically generated using edge-tts from the text questions in the SurgViVQA benchmark. The video frames are from the publicly available SurgViVQA dataset.
Training Data
Built on the SurgViVQA benchmark by Drago et al. (2025), with text questions converted to audio using edge-tts.
| Split | Samples | Video IDs | Purpose |
|---|---|---|---|
| Train | 2,302 | 002-001, 002-002, 002-003 | Model training |
| Eval | 398 | 002-001, 002-002, 002-003 | Validation |
| Test | 1,000 | 002-004 (held-out) | Generalization testing |
Results
Overall Performance
| Metric | Test Set (Held-Out) |
|---|---|
| Overall Accuracy | 63.4% (634/1000) |
| Zero-Shot Baseline | 46% |
| Improvement | +17.4 points |
Performance by Category
Strong (>75%):
- Occlusion detection: 84% (safety-critical)
- Tool presence, dye detection, visibility: 98-100%
Challenging (<50%):
- Motion direction (5-way): 20%
- Spatial localization (4-way): 20%
Usage
Requirements
pip install transformers peft torch librosa
Inference
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
# Load base model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "kulsoom-abdullah/surgvivqa-qwen7b-audio")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
For full inference code including audio processing, see the GitHub repository.
Citation
@software{abdullah2026surgvivqa,
author = {Abdullah, Kulsoom},
title = {SurgViVQA-Audio: Audio-Adapted Qwen2-VL for Surgical Video QA},
year = {2026},
publisher = {GitHub},
url = {https://github.com/kulsoom-abdullah/SurgViVQA-Audio}
}
Dataset Citation
@misc{drago2025surgvivqa,
title={SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding},
author={Mauro Orazio Drago et al.},
year={2025},
eprint={2511.03325},
archivePrefix={arXiv}
}
Links
- Code: GitHub Repository
- Demo Video: Loom Walkthrough
- Contact: Kulsoom Abdullah