metadata
license: apache-2.0
language:
- en
pipeline_tag: video-text-to-text
library_name: transformers
tags:
- video
- audio
- multimodal
- qwen3-vl
DeepFrame+ (main)
DeepFrame+ is an audio-visual language model based on Qwen3-VL-8B with integrated Whisper audio encoder.
Features
- Video Understanding: Process and understand video content with spatial-temporal reasoning
- Audio Understanding: Native audio processing via integrated Whisper encoder
- Audio-Visual Alignment: Aligned audio and video representations for multimodal understanding
- vLLM Compatible: KV cache enabled for efficient inference with vLLM
Usage
With Transformers
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
model = AutoModelForCausalLM.from_pretrained(
"sdioteam/deepframe",
revision="main",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
"sdioteam/deepframe",
revision="main",
trust_remote_code=True
)
With vLLM (recommended for inference)
vllm serve sdioteam/deepframe --revision main --trust-remote-code
Training
This version includes supervised fine-tuning (SFT) on audio-visual Q&A tasks using LoRA adapters, which have been merged into the base model for inference efficiency.
Model Details
- Base Model: Qwen/Qwen3-VL-8B-Instruct
- Audio Encoder: Whisper-large-v3 (integrated)
- Audio Projector: QFormer-based alignment module
- Parameters: ~9.8B total
License
Apache 2.0