|
|
--- |
|
|
license: bsd-3-clause |
|
|
pipeline_tag: video-text-to-text |
|
|
--- |
|
|
|
|
|
# UniPixel-3B |
|
|
|
|
|
<div style="display: flex; gap: 5px;"> |
|
|
<a href="https://arxiv.org/abs/2509.18094" target="_blank"><img src="https://img.shields.io/badge/arXiv-2509.18094-red"></a> |
|
|
<a href="https://polyu-chenlab.github.io/unipixel/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a> |
|
|
<a href="https://github.com/PolyU-ChenLab/UniPixel/blob/main/README.md" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a> |
|
|
<a href="https://github.com/PolyU-ChenLab/UniPixel" target="_blank"><img src="https://img.shields.io/github/stars/PolyU-ChenLab/UniPixel"></a> |
|
|
</div> |
|
|
|
|
|
UniPixel is a unified MLLM for pixel-level vision-language understanding. It flexibly supports a variety of fine-grained tasks, including image/video segmentation, regional understanding, and a novel PixelQA task that jointly requires object-centric referring, segmentation, and question-answering in videos. |
|
|
|
|
|
<p align="center"><img width="750" src="https://raw.githubusercontent.com/PolyU-ChenLab/UniPixel/refs/heads/main/.github/method.jpg"></p> |
|
|
|
|
|
## ๐ Model Details |
|
|
|
|
|
- **Model type:** Multi-modal Large Language Model |
|
|
- **Language(s):** English |
|
|
- **License:** BSD-3-Clause |
|
|
|
|
|
## ๐ Quick Start |
|
|
|
|
|
### Install the environment |
|
|
|
|
|
1. Clone the repository from GitHub. |
|
|
|
|
|
```shell |
|
|
git clone https://github.com/PolyU-ChenLab/UniPixel.git |
|
|
cd UniPixel |
|
|
``` |
|
|
|
|
|
2. Setup the virtual environment. |
|
|
|
|
|
```shell |
|
|
conda create -n unipixel python=3.12 -y |
|
|
conda activate unipixel |
|
|
|
|
|
# you may modify 'cu128' to your own CUDA version |
|
|
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128 |
|
|
|
|
|
# other versions have no been verified |
|
|
pip install flash_attn==2.8.2 --no-build-isolation |
|
|
``` |
|
|
|
|
|
3. Install dependencies. |
|
|
|
|
|
```shell |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
For NPU users, please install the CPU version of PyTorch and [`torch_npu`](https://github.com/Ascend/pytorch) instead. |
|
|
|
|
|
### Quick Inference Demo |
|
|
|
|
|
Try our [online demo](https://huggingface.co/spaces/PolyU-ChenLab/UniPixel) or the [inference script](https://github.com/PolyU-ChenLab/UniPixel/blob/main/tools/inference.py) below. Please refer to our [GitHub Repository](https://github.com/PolyU-ChenLab/UniPixel) for more details. |
|
|
|
|
|
```python |
|
|
import imageio.v3 as iio |
|
|
import nncore |
|
|
|
|
|
from unipixel.dataset.utils import process_vision_info |
|
|
from unipixel.model.builder import build_model |
|
|
from unipixel.utils.io import load_image, load_video |
|
|
from unipixel.utils.transforms import get_sam2_transform |
|
|
from unipixel.utils.visualizer import draw_mask |
|
|
|
|
|
media_path = '<path-to-jpg-or-mp4-file>' |
|
|
prompt = 'Please segment the...' |
|
|
output_dir = 'outputs' |
|
|
|
|
|
model, processor = build_model('PolyU-ChenLab/UniPixel-3B') |
|
|
device = next(model.parameters()).device |
|
|
|
|
|
sam2_transform = get_sam2_transform(model.config.sam2_image_size) |
|
|
|
|
|
if any(media_path.endswith(k) for k in ('jpg', 'png')): |
|
|
frames, images = load_image(media_path), [media_path] |
|
|
else: |
|
|
frames, images = load_video(media_path, sample_frames=16) |
|
|
|
|
|
messages = [{ |
|
|
'role': |
|
|
'user', |
|
|
'content': [{ |
|
|
'type': 'video', |
|
|
'video': images, |
|
|
'min_pixels': 128 * 28 * 28, |
|
|
'max_pixels': 256 * 28 * 28 * int(16 / len(images)) |
|
|
}, { |
|
|
'type': 'text', |
|
|
'text': prompt |
|
|
}] |
|
|
}] |
|
|
|
|
|
text = processor.apply_chat_template(messages, add_generation_prompt=True) |
|
|
|
|
|
images, videos, kwargs = process_vision_info(messages, return_video_kwargs=True) |
|
|
|
|
|
data = processor(text=[text], images=images, videos=videos, return_tensors='pt', **kwargs) |
|
|
|
|
|
data['frames'] = [sam2_transform(frames).to(model.sam2.dtype)] |
|
|
data['frame_size'] = [frames.shape[1:3]] |
|
|
|
|
|
output_ids = model.generate( |
|
|
**data.to(device), |
|
|
do_sample=False, |
|
|
temperature=None, |
|
|
top_k=None, |
|
|
top_p=None, |
|
|
repetition_penalty=None, |
|
|
max_new_tokens=512) |
|
|
|
|
|
assert data.input_ids.size(0) == output_ids.size(0) == 1 |
|
|
output_ids = output_ids[0, data.input_ids.size(1):] |
|
|
|
|
|
if output_ids[-1] == processor.tokenizer.eos_token_id: |
|
|
output_ids = output_ids[:-1] |
|
|
|
|
|
response = processor.decode(output_ids, clean_up_tokenization_spaces=False) |
|
|
print(f'Response: {response}') |
|
|
|
|
|
if len(model.seg) >= 1: |
|
|
imgs = draw_mask(frames, model.seg) |
|
|
|
|
|
nncore.mkdir(output_dir) |
|
|
|
|
|
path = nncore.join(output_dir, f"{nncore.pure_name(media_path)}.{'gif' if len(imgs) > 1 else 'png'}") |
|
|
print(f'Output Path: {path}') |
|
|
iio.imwrite(path, imgs, duration=100, loop=0) |
|
|
``` |
|
|
|
|
|
## ๐ Citation |
|
|
|
|
|
Please kindly cite our paper if you find this project helpful. |
|
|
|
|
|
``` |
|
|
@inproceedings{liu2025unipixel, |
|
|
title={UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning}, |
|
|
author={Liu, Ye and Ma, Zongyang and Pu, Junfu and Qi, Zhongang and Wu, Yang and Ying, Shan and Chen, Chang Wen}, |
|
|
booktitle={Advances in Neural Information Processing Systems (NeurIPS)}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|