Vista-LLM Semantic Scout (Fine-tuned BLIP-2)

This repository contains the fine-tuned BLIP-2 model used as the Semantic Scout in the paper: [Vista-LLM: Decoupled Query-Guided Visual Token Pruning for Efficient Long-Video Large Language Models].

Official GitHub Repository: lizhenyu-123/Vista-LLM

📖 Model Overview

In the context of Long-Video Large Language Models (Video-LLMs), processing massive visual tokens is a significant computational bottleneck. Vista-LLM introduces a decoupled framework for query-guided visual token pruning prior to LLM inference.

By operating in a decoupled manner, it ensures that the primary Video-LLM and the BLIP-2 Vision Encoder remain fully compatible with optimized attention kernels like Flash Attention. While the Q-Former structurally requires materializing attention matrices to compute cross-modal importance scores, its lightweight parallel execution introduces no wall-clock latency overhead.

⚙️ Architecture Details

Base Architecture: The model is built upon the Blip2ForImageTextRetrieval class (BLIP-2 Image-Text Matching).
Vision Encoder: Frozen during fine-tuning to retain powerful pre-trained visual representations.
Q-Former: Fine-tuned to extract text-conditioned visual features via cross-attention.
Note on Key Token Projector: As described in the paper, a Key Token Projector was introduced to project visual features into the Image-Text Contrastive embedding space. However, this module acts purely as a transitional component during the training phase to calculate the alignment loss. The final saved model weights are entirely based on the standard Blip2ForImageTextRetrieval architecture for seamless inference.

🚀 How to Use

To use this Semantic Scout model within the Vista-LLM framework, please follow these steps:

Download the model: Clone or download this Hugging Face repository to your local server.
Setup the main repository: Clone our official Vista-LLM GitHub repository.
Configure the path: Move the downloaded model weights to your designated local model directory.
Run the script: Open the corresponding execution script in the GitHub repository, and update the semantic scout model path argument to point to your local directory containing these weights.

For detailed environmental setup and inference commands, please refer to the README.md in the GitHub repository.

📊 Training Details

Dataset

The model was fine-tuned using region descriptions from the Visual Genome dataset. To ensure semantic density, phrases containing fewer than three words were excluded. An online hard negative mining strategy was employed to enhance discriminative capability, sampling negative counterparts based on a multinomial distribution derived from image-text similarities.

Training Objectives

Optimized using a weighted sum of three loss components:

L_ITM: Image-Text Matching loss.
L_ITC: Image-Text Contrastive loss.
L_Align: A Custom Soft Top-K Alignment Loss guiding the model to focus on key visual regions.

$L_{total} = L_{ITM} + L_{ITC} + \alpha L_{Align}$

Hyperparameters

Optimizer: AdamW
Initial Learning Rate: 5e-5
Batch Size: 16 (gradient accumulation over 20 steps)
Alignment Weight (α): 2.0
Top-K Value (K): 64
Temperature (T): 0.1

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support