deepframe-vllm / README.md
sdio-vincent's picture
Release main
81496a7 verified
metadata
license: apache-2.0
language:
  - en
pipeline_tag: video-text-to-text
library_name: transformers
tags:
  - video
  - audio
  - multimodal
  - qwen3-vl

DeepFrame+ (main)

DeepFrame+ is an audio-visual language model based on Qwen3-VL-8B with integrated Whisper audio encoder.

Features

  • Video Understanding: Process and understand video content with spatial-temporal reasoning
  • Audio Understanding: Native audio processing via integrated Whisper encoder
  • Audio-Visual Alignment: Aligned audio and video representations for multimodal understanding
  • vLLM Compatible: KV cache enabled for efficient inference with vLLM

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

model = AutoModelForCausalLM.from_pretrained(
    "sdioteam/deepframe",
    revision="main",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    "sdioteam/deepframe", 
    revision="main",
    trust_remote_code=True
)

With vLLM (recommended for inference)

vllm serve sdioteam/deepframe --revision main --trust-remote-code

Training

This version includes supervised fine-tuning (SFT) on audio-visual Q&A tasks using LoRA adapters, which have been merged into the base model for inference efficiency.

Model Details

  • Base Model: Qwen/Qwen3-VL-8B-Instruct
  • Audio Encoder: Whisper-large-v3 (integrated)
  • Audio Projector: QFormer-based alignment module
  • Parameters: ~9.8B total

License

Apache 2.0