You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

arXiv Website Github

Model Description

QTSplus-7B is a Qwen2.5-VL–based multimodal LLM finetuned with Query‑Aware Token Selector (QTSplus), a lightweight visual token selection module that acts as an information gate between the vision encoder and the LLM.

  • Query‑aware selection: scores vision tokens via cross‑attention against the input text query.
  • Adaptive retention: predicts an instance‑specific budget and keeps only the most relevant tokens.
  • Temporal reasoning: a small re‑encoder preserves temporal order with absolute time cues.
  • Efficient long‑video understanding: up to 89% vision token compression and 28% end‑to‑end latency reduction on long videos (see paper for details).

Intended Uses & Limitations

Intended uses

  • Long‑video question answering and captioning
  • Multi‑image reasoning and story understanding
  • Efficient multimodal chat with reduced latency on long inputs

Limitations

  • May miss fine details if the predicted retention budget is too small.
  • Inherits biases and failure modes from the base Qwen2.5‑VL model and training data.
  • Not a safety‑aligned system; outputs may be inaccurate or unsafe without human oversight.

Quick Start

The repository is designed around a conda‑based Python 3.11 environment with a CUDA‑enabled GPU.

  1. Create and activate the conda environment
conda create -n qtsplus python=3.11 -y
conda activate qtsplus
  1. Install toolchain and CUDA toolkit
conda install conda-forge::gcc=11 conda-forge::gxx=11 -y
conda install nvidia/label/cuda-12.8.1::cuda-toolkit -y
conda install av -c conda-forge -y
  1. Install PyTorch with CUDA 12.8 support
pip3 install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu128
  1. Install core Python libraries
pip install transformers==4.57.1
DS_BUILD_CUTLASS_OPS=0 DS_BUILD_RAGGED_DEVICE_OPS=0 DS_BUILD_EVOFORMER_ATTN=0 pip install deepspeed
pip install accelerate pandas wandb matplotlib scikit-learn datasets evaluate ftfy sentencepiece bitsandbytes
  1. Install FlashAttention (prebuilt wheel)
pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.4.22/flash_attn-2.8.1+cu128torch2.9-cp311-cp311-linux_x86_64.whl

This wheel is specific to Linux x86_64, CUDA 12.8, PyTorch 2.9.0 and Python 3.11; if you deviate from this configuration, you will need to install a compatible FlashAttention build instead.

  1. Verify installation

After installation, you should be able to run:

python -c "import torch, transformers, deepspeed, accelerate; print(torch.cuda.is_available())"

which should print True on a correctly configured GPU machine.

Video example

from __future__ import annotations

import argparse
import os
from typing import Any, Dict, Tuple, Union

import torch
from transformers import AutoModelForCausalLM, AutoProcessor


def _uniform_indices(num_frames: int, total: int) -> list[int]:
    if total <= 0 or num_frames <= 0:
        return []
    num_frames = min(int(num_frames), int(total))
    if num_frames == 1:
        return [max(0, (total - 1) // 2)]
    denom = num_frames - 1
    last = total - 1
    return [int((i * last + (denom // 2)) // denom) for i in range(num_frames)]


def _resolve_video_input(path: str, max_frames: int) -> Tuple[Union[str, list[str], list[list[str]]], bool]:
    """Return (videos, is_frames_list)."""
    if os.path.isdir(path):
        exts = (".jpg", ".jpeg", ".png", ".bmp")
        files = sorted([f for f in os.listdir(path) if f.lower().endswith(exts)])
        if not files:
            raise FileNotFoundError(f"No image frames found under: {path}")
        idx = _uniform_indices(max_frames, len(files))
        frames = ["file://" + os.path.join(path, files[i]) for i in idx]
        # For HF video processors: a single "video" represented by a list of frame paths.
        return [frames], True
    return path, False


def _to_device(x: Any, *, device: torch.device, dtype: torch.dtype) -> Any:
    if torch.is_tensor(x):
        if x.is_floating_point():
            return x.to(device=device, dtype=dtype)
        return x.to(device=device)
    if isinstance(x, dict):
        return {k: _to_device(v, device=device, dtype=dtype) for k, v in x.items()}
    return x


def main() -> int:
    parser = argparse.ArgumentParser(description="QTSplus-Qwen2.5-VL-7B video QA demo")
    parser.add_argument(
        "--model",
        type=str,
        default="AlpachinoNLP/QTSplus-Qwen2.5-VL-7B",
        help="HuggingFace model ID or path",
    )
    parser.add_argument("--video", type=str, default="your/video/path", help="Path to a video file or a frame dir")
    parser.add_argument(
        "--question",
        type=str,
        default="Summarize the key events in this video.",
        help="Question about the video (if omitted, you will be prompted)",
    )
    parser.add_argument(
        "--system_prompt",
        type=str,
        default="You are a helpful assistant.",
        help="System prompt used by the training chat template",
    )
    parser.add_argument("--fps", type=float, default=1.0, help="Sampling FPS used when `--video` is a video file")
    parser.add_argument("--max_frames", type=int, default=40, help="Max frames used from the video/frames dir")
    parser.add_argument("--min_frames", type=int, default=1, help="Min frames when `--video` is a video file")
    parser.add_argument("--max_pixels", type=int, default=360 * 420, help="Max pixels constraint for video frames")
    parser.add_argument("--max_new_tokens", type=int, default=256, help="Max new tokens to generate")
    args = parser.parse_args()

    question = (args.question or "").strip()
    if not question:
        question = input("Question: ").strip()
    if not question:
        raise SystemExit("Empty question.")

    processor = AutoProcessor.from_pretrained(args.model, trust_remote_code=True)

    # Match training-time I/O:
    # - `processor(...)` must return: input_ids, attention_mask, question_input_ids, vision_input
    # - prompt contains a *single* <|video_pad|> placeholder
    dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
    try:
        model = AutoModelForCausalLM.from_pretrained(
            args.model,
            trust_remote_code=True,
            dtype=dtype,
            device_map="auto" if torch.cuda.is_available() else None,
            low_cpu_mem_usage=True,
        ).eval()
    except TypeError:
        model = AutoModelForCausalLM.from_pretrained(
            args.model,
            trust_remote_code=True,
            torch_dtype=dtype,
            device_map="auto" if torch.cuda.is_available() else None,
            low_cpu_mem_usage=True,
        ).eval()

    videos, is_frames_list = _resolve_video_input(args.video, max_frames=args.max_frames)
    if is_frames_list:
        # When passing a list of frames, Qwen2VLVideoProcessor does not support sampling again.
        videos_kwargs: Dict[str, Any] = {"do_sample_frames": False, "max_pixels": args.max_pixels}
    else:
        videos_kwargs = {
            "fps": float(args.fps),
            "max_frames": int(args.max_frames),
            "min_frames": int(args.min_frames),
            "max_pixels": int(args.max_pixels),
            "do_sample_frames": True,
        }

    inputs: Dict[str, Any] = processor(
        text=question,
        videos=videos,
        return_tensors="pt",
        system_prompt=args.system_prompt,
        videos_kwargs=videos_kwargs,
    )

    # Move nested tensors (vision_input is a dict) to the model device.
    inputs = _to_device(inputs, device=model.device, dtype=model.dtype)

    with torch.inference_mode():
        output_ids = model.generate(**inputs, max_new_tokens=args.max_new_tokens, do_sample=False)

    input_len = int(inputs["input_ids"].shape[1])
    gen_ids = output_ids[0, input_len:]
    answer = processor.tokenizer.decode(gen_ids, skip_special_tokens=True).strip()
    print("\nAnswer:\n" + answer)
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

Multiple images (treated as a video sequence)


Notes

  • The chat template is applied via processor.apply_chat_template and expects the messages schema shown above.
  • QTSplus expects the vision payload under the vision_input keyword argument during generation.
  • For fully offline use, pass local_files_only=True to from_pretrained calls once the files are cached locally.

Efficiency & Controls

The following QTSplus hyperparameters in config.json control compression and selection behavior:

  • qts_plus_rho_min / qts_plus_rho_max: min/max retention ratio bounds.(default: 0.05 / 0.5)
  • qts_plus_tau_s: scoring temperature for cross‑attention.(default: 0.5)
  • qts_plus_nmax: hard cap on selected tokens per sample. (default: 25600) These trade off detail vs. speed/memory. See the paper for guidance, ablations, and latency/throughput measurements.

Safety, Bias, and Limitations

  • Outputs may be factually incorrect, biased, or unsafe. Do not use without human oversight.
  • QTSplus compresses the vision stream; extremely small budgets may drop rare but important details.
  • Inherits safety/bias characteristics from the underlying Qwen2.5‑VL model and training data.

Citation

If you find this work helpful, please cite:

@misc{li2025seeingforesttreesqueryaware,
  title        = {Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models},
  author       = {Siyou Li and Huanan Wu and Juexi Shao and Yinghao Ma and Yujian Gan and Yihao Luo and Yuwei Wang and Dong Nie and Lu Wang and Wengqing Wu and Le Zhang and Massimo Poesio and Juntao Yu},
  year         = {2025},
  eprint       = {2511.11910},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2511.11910}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for AlpachinoNLP/QTSplus-Qwen2.5-VL-7B