ColQwen3 Architecture

Model Origins

ColQwen3 8B is based on the ColBERT late interaction paradigm, adapted for visual document retrieval using Qwen3-VL as the backbone.

Base Models Merged

tomoro-ai/Colqwen3-8B-base - Foundation visual-language model
Custom projection layers - Trained for document embedding
Visual processor - Qwen3-VL image understanding

Late Interaction (MaxSim)

Unlike dense retrievers that produce single vectors, ColBERT-style models produce token-level embeddings:

Query: "financial report"
        ↓
[emb_financial, emb_report]  # N query tokens

Document Page:
        ↓
[emb_Q3, emb_revenue, emb_chart, ...]  # M document tokens

MaxSim Score = Σ max(sim(q_i, d_j)) for all j
             = sum of best matches for each query token

This enables:

Fine-grained matching - individual terms matter
Passage-level relevance - not just document-level
Interpretable scores - which terms matched

Projection Layers

Raw embeddings from Qwen3-VL are 4096-dimensional. We project them down for efficiency:

Layer	Input Dim	Output Dim	Parameters
128D	4096	128	524K
320D	4096	320	1.3M

When to Use Each

128D: Real-time search, memory-constrained
320D: Batch indexing, quality-critical applications

Image Processing Pipeline

PDF Page / Image
      │
      ▼
┌─────────────────────────────┐
│ Resize to 1024×1024 max     │
│ (preserve aspect ratio)     │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Qwen3-VL Vision Encoder     │
│ Patch embedding + attention │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ <|image_pad|> token expand  │
│ → Token-level embeddings    │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Projection Layer            │
│ 4096D → 128D/320D           │
└──────────────┬──────────────┘
               │
         Document Embedding
         [num_patches × dim]

Query Processing

Text queries go through the language model only:

Query Text
      │
      ▼
┌─────────────────────────────┐
│ Tokenizer                   │
│ → Token IDs                 │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Qwen3-VL Text Encoder       │
│ → Hidden states             │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Projection Layer            │
│ 4096D → 128D/320D           │
└──────────────┬──────────────┘
               │
         Query Embedding
         [num_tokens × dim]

Memory Layout

On Apple Silicon (MLX):

┌─────────────────────────────────────┐
│ Unified Memory                       │
├─────────────────────────────────────┤
│ Model weights        ~17GB          │
│ KV Cache            ~1-2GB          │
│ Projection layers   ~5MB            │
│ Working memory      ~1GB            │
├─────────────────────────────────────┤
│ Total              ~18-20GB         │
└─────────────────────────────────────┘

Indexing Strategy

For production deployment:

Pre-compute document embeddings (offline)
Store in vector database (LanceDB, Qdrant, etc.)
Online query embedding (real-time)
MaxSim scoring (can be batched)

# Indexing (offline)
for page in pdf_pages:
    embedding = embedder.embed_image(page)
    vector_db.insert(doc_id, page_num, embedding)

# Search (online)
query_emb = embedder.embed_query(query_text)
candidates = vector_db.search(query_emb, k=100)
scores = [embedder.maxsim(query_emb, doc_emb) for doc_emb in candidates]

File Format

Model weights use MLX-compatible safetensors:

model-00001-of-00007.safetensors  # 5.0GB
model-00002-of-00007.safetensors  # 4.9GB
model-00003-of-00007.safetensors  # 4.8GB
model-00004-of-00007.safetensors  # 4.8GB
model-00005-of-00007.safetensors  # 5.0GB
model-00006-of-00007.safetensors  # 5.0GB
model-00007-of-00007.safetensors  # 3.2GB
                                  --------
                           Total: ~35GB

Projection layers are separate safetensors files for flexibility.

Co-Authored-By: Maciej & Klaudiusz