---
license: apache-2.0
language:
- en
pipeline_tag: video-text-to-text
library_name: transformers
tags:
- video
- audio
- multimodal
- qwen3-vl
---

# DeepFrame+ (main)

DeepFrame+ is an audio-visual language model based on Qwen3-VL-8B with integrated Whisper audio encoder.

## Features

- **Video Understanding**: Process and understand video content with spatial-temporal reasoning
- **Audio Understanding**: Native audio processing via integrated Whisper encoder
- **Audio-Visual Alignment**: Aligned audio and video representations for multimodal understanding
- **vLLM Compatible**: KV cache enabled for efficient inference with vLLM

## Usage

### With Transformers

```python
from transformers import AutoModelForCausalLM, AutoProcessor
import torch

model = AutoModelForCausalLM.from_pretrained(
    "sdioteam/deepframe",
    revision="main",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    "sdioteam/deepframe", 
    revision="main",
    trust_remote_code=True
)
```

### With vLLM (recommended for inference)

```bash
vllm serve sdioteam/deepframe --revision main --trust-remote-code
```

## Training

This version includes supervised fine-tuning (SFT) on audio-visual Q&A tasks using LoRA adapters, 
which have been merged into the base model for inference efficiency.

## Model Details

- **Base Model**: Qwen/Qwen3-VL-8B-Instruct
- **Audio Encoder**: Whisper-large-v3 (integrated)
- **Audio Projector**: QFormer-based alignment module
- **Parameters**: ~9.8B total

## License

Apache 2.0