sdioteam
/

deepframe-vllm

Video-Text-to-Text

text-generation

Model card Files Files and versions

deepframe-vllm / README.md

sdio-vincent's picture

Release main

81496a7 verified 5 days ago

|

history blame contribute delete

1.59 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: video-text-to-text
	library_name: transformers
	tags:
	- video
	- audio
	- multimodal
	- qwen3-vl
	---

	# DeepFrame+ (main)

	DeepFrame+ is an audio-visual language model based on Qwen3-VL-8B with integrated Whisper audio encoder.

	## Features

	- Video Understanding: Process and understand video content with spatial-temporal reasoning
	- Audio Understanding: Native audio processing via integrated Whisper encoder
	- Audio-Visual Alignment: Aligned audio and video representations for multimodal understanding
	- vLLM Compatible: KV cache enabled for efficient inference with vLLM

	## Usage

	### With Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoProcessor
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"sdioteam/deepframe",
	revision="main",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)
	processor = AutoProcessor.from_pretrained(
	"sdioteam/deepframe",
	revision="main",
	trust_remote_code=True
	)
	```

	### With vLLM (recommended for inference)

	```bash
	vllm serve sdioteam/deepframe --revision main --trust-remote-code
	```

	## Training

	This version includes supervised fine-tuning (SFT) on audio-visual Q&A tasks using LoRA adapters,
	which have been merged into the base model for inference efficiency.

	## Model Details

	- Base Model: Qwen/Qwen3-VL-8B-Instruct
	- Audio Encoder: Whisper-large-v3 (integrated)
	- Audio Projector: QFormer-based alignment module
	- Parameters: ~9.8B total

	## License

	Apache 2.0