deepframe-vllm / README.md
sdio-vincent's picture
Release main
81496a7 verified
---
license: apache-2.0
language:
- en
pipeline_tag: video-text-to-text
library_name: transformers
tags:
- video
- audio
- multimodal
- qwen3-vl
---
# DeepFrame+ (main)
DeepFrame+ is an audio-visual language model based on Qwen3-VL-8B with integrated Whisper audio encoder.
## Features
- **Video Understanding**: Process and understand video content with spatial-temporal reasoning
- **Audio Understanding**: Native audio processing via integrated Whisper encoder
- **Audio-Visual Alignment**: Aligned audio and video representations for multimodal understanding
- **vLLM Compatible**: KV cache enabled for efficient inference with vLLM
## Usage
### With Transformers
```python
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
model = AutoModelForCausalLM.from_pretrained(
"sdioteam/deepframe",
revision="main",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
"sdioteam/deepframe",
revision="main",
trust_remote_code=True
)
```
### With vLLM (recommended for inference)
```bash
vllm serve sdioteam/deepframe --revision main --trust-remote-code
```
## Training
This version includes supervised fine-tuning (SFT) on audio-visual Q&A tasks using LoRA adapters,
which have been merged into the base model for inference efficiency.
## Model Details
- **Base Model**: Qwen/Qwen3-VL-8B-Instruct
- **Audio Encoder**: Whisper-large-v3 (integrated)
- **Audio Projector**: QFormer-based alignment module
- **Parameters**: ~9.8B total
## License
Apache 2.0