deepframe-vllm / README.md

sdio-vincent

Release main

81496a7 verified 4 days ago

preview code

raw

history blame contribute delete

1.59 kB

metadata

license: apache-2.0
language:
  - en
pipeline_tag: video-text-to-text
library_name: transformers
tags:
  - video
  - audio
  - multimodal
  - qwen3-vl

DeepFrame+ (main)

DeepFrame+ is an audio-visual language model based on Qwen3-VL-8B with integrated Whisper audio encoder.

Features

Video Understanding: Process and understand video content with spatial-temporal reasoning
Audio Understanding: Native audio processing via integrated Whisper encoder
Audio-Visual Alignment: Aligned audio and video representations for multimodal understanding
vLLM Compatible: KV cache enabled for efficient inference with vLLM

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

model = AutoModelForCausalLM.from_pretrained(
    "sdioteam/deepframe",
    revision="main",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    "sdioteam/deepframe", 
    revision="main",
    trust_remote_code=True
)

With vLLM (recommended for inference)

vllm serve sdioteam/deepframe --revision main --trust-remote-code

Training

This version includes supervised fine-tuning (SFT) on audio-visual Q&A tasks using LoRA adapters, which have been merged into the base model for inference efficiency.

Model Details

Base Model: Qwen/Qwen3-VL-8B-Instruct
Audio Encoder: Whisper-large-v3 (integrated)
Audio Projector: QFormer-based alignment module
Parameters: ~9.8B total

License

Apache 2.0