--- license: apache-2.0 language: - en pipeline_tag: video-text-to-text library_name: transformers tags: - video - audio - multimodal - qwen3-vl --- # DeepFrame+ (main) DeepFrame+ is an audio-visual language model based on Qwen3-VL-8B with integrated Whisper audio encoder. ## Features - **Video Understanding**: Process and understand video content with spatial-temporal reasoning - **Audio Understanding**: Native audio processing via integrated Whisper encoder - **Audio-Visual Alignment**: Aligned audio and video representations for multimodal understanding - **vLLM Compatible**: KV cache enabled for efficient inference with vLLM ## Usage ### With Transformers ```python from transformers import AutoModelForCausalLM, AutoProcessor import torch model = AutoModelForCausalLM.from_pretrained( "sdioteam/deepframe", revision="main", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) processor = AutoProcessor.from_pretrained( "sdioteam/deepframe", revision="main", trust_remote_code=True ) ``` ### With vLLM (recommended for inference) ```bash vllm serve sdioteam/deepframe --revision main --trust-remote-code ``` ## Training This version includes supervised fine-tuning (SFT) on audio-visual Q&A tasks using LoRA adapters, which have been merged into the base model for inference efficiency. ## Model Details - **Base Model**: Qwen/Qwen3-VL-8B-Instruct - **Audio Encoder**: Whisper-large-v3 (integrated) - **Audio Projector**: QFormer-based alignment module - **Parameters**: ~9.8B total ## License Apache 2.0