SurgViVQA-Audio: Audio-Adapted Qwen2-VL for Surgical Video QA

This repository contains the weights for SurgViVQA-Audio, presented in the paper SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding.

This model adapts a Whisper audio encoder to feed directly into Qwen2-VL, enabling hands-free surgical video question answering without intermediate ASR transcription.

⚠️ Research prototype only. Not for clinical use.

Model Description

Paper: SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
Base Vision-Language Model: Qwen2-VL-7B-Instruct
Audio Encoder: Whisper Large v3 Turbo (frozen)
Audio Projector: Linear layer (1280 → 3584) mapping Whisper features to Qwen2-VL embedding space
Training Method: QLoRA (4-bit base model + BF16 LoRA adapters)
Domain: Colonoscopy surgical procedures
Code: GitHub Repository

Why Skip ASR?

Standard pipelines (Audio → ASR → Text → LLM) add latency and propagate transcription errors. By injecting audio embeddings directly into the vision-language model, this approach:

2.5× faster inference (0.9s vs 2.3s end-to-end)
- Measured on single-sample inference on 1× RTX 4090, same preprocessing and prompt length for both pipelines.
Avoids ASR text errors and reduces error propagation from mis-transcriptions

Intended Use

Appropriate uses:

Research on multimodal medical AI
Benchmarking audio-visual question answering
Exploring hands-free interfaces for surgical assistance
Educational demonstrations of VLM fine-tuning

Out of scope:

Clinical decision-making or diagnosis
Real patient data processing
Production deployment without extensive validation
Any use requiring regulatory approval

Privacy Note

No patient audio was used in training. All audio was synthetically generated using edge-tts from the text questions in the SurgViVQA benchmark. The video frames are from the publicly available SurgViVQA dataset.

Training Data

Built on the SurgViVQA benchmark by Drago et al. (2025), with text questions converted to audio using edge-tts.

Split	Samples	Video IDs	Purpose
Train	2,302	002-001, 002-002, 002-003	Model training
Eval	398	002-001, 002-002, 002-003	Validation
Test	1,000	002-004 (held-out)	Generalization testing

Results

Overall Performance

Metric	Test Set (Held-Out)
Overall Accuracy	63.4% (634/1000)
Zero-Shot Baseline	46%
Improvement	+17.4 points

Performance by Category

Strong (>75%):

Occlusion detection: 84% (safety-critical)
Tool presence, dye detection, visibility: 98-100%

Challenging (<50%):

Motion direction (5-way): 20%
Spatial localization (4-way): 20%

Usage

Requirements

pip install transformers peft torch librosa

Inference

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

# Load base model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "kulsoom-abdullah/surgvivqa-qwen7b-audio")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

For full inference code including audio processing, see the GitHub repository.

Citation

@software{abdullah2026surgvivqa,
  author = {Abdullah, Kulsoom},
  title = {SurgViVQA-Audio: Audio-Adapted Qwen2-VL for Surgical Video QA},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/kulsoom-abdullah/SurgViVQA-Audio}
}

Dataset Citation

@misc{drago2025surgvivqa,
  title={SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding},
  author={Mauro Orazio Drago et al.},
  year={2025},
  eprint={2511.03325},
  archivePrefix={arXiv}
}

Model tree for kulsoom-abdullah/surgvivqa-qwen7b-audio

Base model

Qwen/Qwen2-VL-7B

Finetuned

Qwen/Qwen2-VL-7B-Instruct