|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: video-text-to-text |
|
|
library_name: transformers |
|
|
tags: |
|
|
- video |
|
|
- audio |
|
|
- multimodal |
|
|
- qwen3-vl |
|
|
--- |
|
|
|
|
|
# DeepFrame+ (main) |
|
|
|
|
|
DeepFrame+ is an audio-visual language model based on Qwen3-VL-8B with integrated Whisper audio encoder. |
|
|
|
|
|
## Features |
|
|
|
|
|
- **Video Understanding**: Process and understand video content with spatial-temporal reasoning |
|
|
- **Audio Understanding**: Native audio processing via integrated Whisper encoder |
|
|
- **Audio-Visual Alignment**: Aligned audio and video representations for multimodal understanding |
|
|
- **vLLM Compatible**: KV cache enabled for efficient inference with vLLM |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
|
import torch |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"sdioteam/deepframe", |
|
|
revision="main", |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained( |
|
|
"sdioteam/deepframe", |
|
|
revision="main", |
|
|
trust_remote_code=True |
|
|
) |
|
|
``` |
|
|
|
|
|
### With vLLM (recommended for inference) |
|
|
|
|
|
```bash |
|
|
vllm serve sdioteam/deepframe --revision main --trust-remote-code |
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
This version includes supervised fine-tuning (SFT) on audio-visual Q&A tasks using LoRA adapters, |
|
|
which have been merged into the base model for inference efficiency. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: Qwen/Qwen3-VL-8B-Instruct |
|
|
- **Audio Encoder**: Whisper-large-v3 (integrated) |
|
|
- **Audio Projector**: QFormer-based alignment module |
|
|
- **Parameters**: ~9.8B total |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|