SurgViVQA-Audio: Audio-Adapted Qwen2-VL for Surgical Video QA

This repository contains the weights for SurgViVQA-Audio, presented in the paper SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding.

This model adapts a Whisper audio encoder to feed directly into Qwen2-VL, enabling hands-free surgical video question answering without intermediate ASR transcription.

⚠️ Research prototype only. Not for clinical use.

Model Description

Why Skip ASR?

Standard pipelines (Audio → ASR → Text → LLM) add latency and propagate transcription errors. By injecting audio embeddings directly into the vision-language model, this approach:

  • 2.5× faster inference (0.9s vs 2.3s end-to-end)
    • Measured on single-sample inference on 1× RTX 4090, same preprocessing and prompt length for both pipelines.
  • Avoids ASR text errors and reduces error propagation from mis-transcriptions

Intended Use

Appropriate uses:

  • Research on multimodal medical AI
  • Benchmarking audio-visual question answering
  • Exploring hands-free interfaces for surgical assistance
  • Educational demonstrations of VLM fine-tuning

Out of scope:

  • Clinical decision-making or diagnosis
  • Real patient data processing
  • Production deployment without extensive validation
  • Any use requiring regulatory approval

Privacy Note

No patient audio was used in training. All audio was synthetically generated using edge-tts from the text questions in the SurgViVQA benchmark. The video frames are from the publicly available SurgViVQA dataset.

Training Data

Built on the SurgViVQA benchmark by Drago et al. (2025), with text questions converted to audio using edge-tts.

Split Samples Video IDs Purpose
Train 2,302 002-001, 002-002, 002-003 Model training
Eval 398 002-001, 002-002, 002-003 Validation
Test 1,000 002-004 (held-out) Generalization testing

Results

Overall Performance

Metric Test Set (Held-Out)
Overall Accuracy 63.4% (634/1000)
Zero-Shot Baseline 46%
Improvement +17.4 points

Performance by Category

Strong (>75%):

  • Occlusion detection: 84% (safety-critical)
  • Tool presence, dye detection, visibility: 98-100%

Challenging (<50%):

  • Motion direction (5-way): 20%
  • Spatial localization (4-way): 20%

Usage

Requirements

pip install transformers peft torch librosa

Inference

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

# Load base model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "kulsoom-abdullah/surgvivqa-qwen7b-audio")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

For full inference code including audio processing, see the GitHub repository.

Citation

@software{abdullah2026surgvivqa,
  author = {Abdullah, Kulsoom},
  title = {SurgViVQA-Audio: Audio-Adapted Qwen2-VL for Surgical Video QA},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/kulsoom-abdullah/SurgViVQA-Audio}
}

Dataset Citation

@misc{drago2025surgvivqa,
  title={SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding},
  author={Mauro Orazio Drago et al.},
  year={2025},
  eprint={2511.03325},
  archivePrefix={arXiv}
}

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kulsoom-abdullah/surgvivqa-qwen7b-audio

Base model

Qwen/Qwen2-VL-7B
Finetuned
(594)
this model

Paper for kulsoom-abdullah/surgvivqa-qwen7b-audio