Vista-LLM Semantic Scout (Fine-tuned BLIP-2)
This repository contains the fine-tuned BLIP-2 model used as the Semantic Scout in the paper: [Vista-LLM: Decoupled Query-Guided Visual Token Pruning for Efficient Long-Video Large Language Models].
Official GitHub Repository: lizhenyu-123/Vista-LLM
π Model Overview
In the context of Long-Video Large Language Models (Video-LLMs), processing massive visual tokens is a significant computational bottleneck. Vista-LLM introduces a decoupled framework for query-guided visual token pruning prior to LLM inference.
By operating in a decoupled manner, it ensures that the primary Video-LLM and the BLIP-2 Vision Encoder remain fully compatible with optimized attention kernels like Flash Attention. While the Q-Former structurally requires materializing attention matrices to compute cross-modal importance scores, its lightweight parallel execution introduces no wall-clock latency overhead.
βοΈ Architecture Details
- Base Architecture: The model is built upon the
Blip2ForImageTextRetrievalclass (BLIP-2 Image-Text Matching). - Vision Encoder: Frozen during fine-tuning to retain powerful pre-trained visual representations.
- Q-Former: Fine-tuned to extract text-conditioned visual features via cross-attention.
- Note on Key Token Projector: As described in the paper, a Key Token Projector was introduced to project visual features into the Image-Text Contrastive embedding space. However, this module acts purely as a transitional component during the training phase to calculate the alignment loss. The final saved model weights are entirely based on the standard
Blip2ForImageTextRetrievalarchitecture for seamless inference.
π How to Use
To use this Semantic Scout model within the Vista-LLM framework, please follow these steps:
- Download the model: Clone or download this Hugging Face repository to your local server.
- Setup the main repository: Clone our official Vista-LLM GitHub repository.
- Configure the path: Move the downloaded model weights to your designated local model directory.
- Run the script: Open the corresponding execution script in the GitHub repository, and update the semantic scout model path argument to point to your local directory containing these weights.
For detailed environmental setup and inference commands, please refer to the README.md in the GitHub repository.
π Training Details
Dataset
The model was fine-tuned using region descriptions from the Visual Genome dataset. To ensure semantic density, phrases containing fewer than three words were excluded. An online hard negative mining strategy was employed to enhance discriminative capability, sampling negative counterparts based on a multinomial distribution derived from image-text similarities.
Training Objectives
Optimized using a weighted sum of three loss components:
- L_ITM: Image-Text Matching loss.
- L_ITC: Image-Text Contrastive loss.
- L_Align: A Custom Soft Top-K Alignment Loss guiding the model to focus on key visual regions.
Hyperparameters
- Optimizer: AdamW
- Initial Learning Rate: 5e-5
- Batch Size: 16 (gradient accumulation over 20 steps)
- Alignment Weight (Ξ±): 2.0
- Top-K Value (K): 64
- Temperature (T): 0.1
- Downloads last month
- 1