File size: 9,010 Bytes
c72bb0a
6b3ef22
 
 
c72bb0a
 
6b3ef22
c72bb0a
6b3ef22
 
 
 
 
 
 
 
 
c72bb0a
 
 
 
6b3ef22
c72bb0a
6b3ef22
c72bb0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b3ef22
 
c72bb0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b3ef22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
base_model: Qwen/Qwen3-VL-8B-Instruct
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: visual-document-retrieval
tags:
- reranker
- rerank
- listwise-reranker
- visual-document-retrieval
- multimodal
- document-understanding
- qwen3-vl
- rankgpt
- mmdocir
---

# ZipRerank

**ZipRerank** is a **listwise reranker for visual documents**, introduced in the paper [Very Efficient Listwise Multimodal Reranking for Long Documents](https://huggingface.co/papers/2605.11864). The official implementation is available on [GitHub](https://github.com/dukesun99/ZipRerank).

Built on top of [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct), ZipRerank is designed for high-efficiency multimodal reranking. Given a text query and a set of document page images (typically rendered from a PDF), the model scores every page and returns them ordered from most to least relevant in a single forward pass.

ZipRerank can be used either as:

- **the second stage** of a two-stage visual-document retrieval pipeline β€” a cheap first-stage
  retriever (dense page embeddings, ColPali / ColQwen family, keyword search, …) proposes the
  top-K pages, and ZipRerank reranks them; or
- **a dedicated sliding-window reranker with overlap**, ranking up to 20 pages per forward pass
  and stitching longer documents together with an overlapping window loop.

## Model details

| | |
|---|---|
| Architecture | `Qwen3VLForConditionalGeneration` |
| Base model | `Qwen/Qwen3-VL-8B-Instruct` |
| Parameters | ~8B |
| Precision | `bfloat16` |
| Max window size | 20 pages per forward pass |
| Training data | MMDocIR training set + RankZephyr data |

## Installation

```bash
pip install "transformers>=4.57" accelerate torch torchvision pillow pymupdf
# Optional but strongly recommended for fast inference:
pip install flash-attn --no-build-isolation
```

## Quick start

The snippet below is self-contained. It:

1. Renders a PDF to page images with PyMuPDF (you can also pass your own `PIL.Image` list).
2. Builds a RankGPT-style prompt that asks the model to rank pages `A`–`T`.
3. Terminates the prompt with a `[` token so the first predicted token is a letter.
4. Reads the logits at that position for each letter and sorts pages by score.

```python
import fitz  # PyMuPDF
import torch
from PIL import Image
from transformers import AutoProcessor
from transformers.models.qwen3_vl import Qwen3VLForConditionalGeneration

MODEL_ID = "mtri-admin/ZipRerank"

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",  # or "sdpa" if flash-attn is unavailable
    trust_remote_code=True,
).eval()
tokenizer = processor.tokenizer


def pdf_to_images(pdf_path: str, max_size: int = 1024):
    """Render every page so the longest edge is at most ``max_size`` pixels."""
    doc = fitz.open(pdf_path)
    images = []
    for page in doc:
        scale = max_size / max(page.rect.width, page.rect.height)
        pix = page.get_pixmap(matrix=fitz.Matrix(scale, scale))
        images.append(Image.frombytes("RGB", (pix.width, pix.height), pix.samples))
    doc.close()
    return images


def create_ranking_prompt(query: str, num_passages: int) -> str:
    lines = [
        "You are RankGPT, an intelligent assistant that can rank passages "
        "based on their relevancy to the query.",
        "",
        f"I will provide you with {num_passages} passages as images.",
        "Rank the passages based on their relevance to the search query.",
        "",
        "The images are provided in order: "
        + ", ".join(
            f"Picture {i + 1} is passage [{chr(ord('A') + i)}]"
            for i in range(num_passages)
        )
        + ".",
        "",
        f"Search Query: {query}",
        "",
        "Rank the passages above based on their relevance to the search query.",
        "The passages should be listed in descending order using identifiers.",
        "The most relevant passages should be listed first.",
        "The output format should be [A] > [B], etc.",
        "Only output the ranking results, do not say anything else.",
    ]
    return "
".join(lines)


@torch.no_grad()
def rerank_window(query: str, images):
    """Rank up to 20 page images in a single forward pass.

    Returns a list of 0-based indices into ``images``, ordered best-first.
    """
    assert 1 <= len(images) <= 20, "Window size must be between 1 and 20."
    messages = [{
        "role": "user",
        "content": [{"type": "text", "text": create_ranking_prompt(query, len(images))}]
                   + [{"type": "image", "image": img} for img in images],
    }]
    inputs = processor.apply_chat_template(
        [messages],
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
    )
    # Force the first predicted token to be a letter by appending "["
    prompt_ids = inputs["input_ids"][0].tolist()
    prompt_ids.append(tokenizer.encode("[", add_special_tokens=False)[0])
    input_ids = torch.tensor([prompt_ids], dtype=torch.long, device=model.device)

    logits = model(
        input_ids=input_ids,
        attention_mask=torch.ones_like(input_ids),
        pixel_values=inputs["pixel_values"].to(model.device),
        image_grid_thw=inputs["image_grid_thw"].to(model.device),
    ).logits[0, -1, :]

    letter_ids = [
        tokenizer.encode(chr(ord("A") + i), add_special_tokens=False)[0]
        for i in range(len(images))
    ]
    scores = [logits[tid].item() for tid in letter_ids]
    return sorted(range(len(images)), key=lambda i: scores[i], reverse=True)


pages = pdf_to_images("report.pdf", max_size=1024)
ranking = rerank_window("What is the company revenue?", pages[:20])
print("Best-first page indices:", ranking)
```

### Sliding window for long documents

For documents with more than 20 pages, slide a window from the end of the list toward the
beginning, progressively bubbling the most relevant pages to the front. Typical defaults are
`window_size=20`, `stride=10` (50% overlap).

```python
def rerank(query, images, window_size=20, stride=10):
    n = len(images)
    ws = min(window_size, n)
    st = min(stride, n)

    if n <= ws:
        return rerank_window(query, images)

    indices = list(range(n))
    cur = list(images)
    end, start = n, n - ws
    while end > 0 and start + st != 0:
        start = max(start, 0)
        ranked = rerank_window(query, cur[start:end])
        new_indices = [indices[start + p] for p in ranked]
        new_images = [cur[start + p] for p in ranked]
        for i, (idx, img) in enumerate(zip(new_indices, new_images)):
            indices[start + i] = idx
            cur[start + i] = img
        end -= st
        start -= st
    return indices


ranking = rerank("What is the company revenue?", pages)
```

> **Tip.** For maximum throughput on long documents, add a content-addressed LRU cache
> around `model.model.get_image_features(...)` so that overlapping windows reuse ViT
> embeddings across calls β€” each page image then needs to be encoded by the vision tower at
> most once per query.

### Using your own images

`rerank_window` / `rerank` accept any list of `PIL.Image.Image`. If you already have page
images (e.g. from `pdf2image`, a screenshot pipeline, or a document layout tool), you can skip
`pdf_to_images` entirely:

```python
from PIL import Image

images = [Image.open(p).convert("RGB") for p in ["page1.png", "page2.png", "page3.png"]]
ranking = rerank("architecture diagram", images)
```

## How it works

1. **Prompt construction** β€” a RankGPT-style prompt asks the model to rank the pages
   (labeled `A`–`T`) by relevance to the query.
2. **Logits scoring** β€” one forward pass; the logit for each letter token at the last
   position is the relevance score for that page.
3. **Sliding window** β€” for `n > window_size`, a window slides from the end of the list
   toward the start, progressively reranking overlapping slices.

## Intended use and limitations

**Intended use.** Reranking of candidate document pages for tasks such as visual document
question answering, enterprise document search, and RAG over PDFs. Works either as the
second stage of a retrieve-then-rerank pipeline, or as a standalone sliding-window reranker
over an arbitrary list of page images.

**Out-of-scope.** ZipRerank is not a first-stage retriever: running it over every page of a
large corpus is expensive and unnecessary β€” use a cheap retriever first, then rerank the
top-K pages with ZipRerank.

**Limitations.**
- Training focused on English documents; multilingual performance has not been evaluated,
  so results on non-English content may vary.
- The window size is capped at 20 pages per forward pass (letters `A`–`T`); longer documents
  rely on the sliding-window procedure described above.