| # ColQwen3 Architecture | |
| **Created by M&K (c)2025 The LibraxisAI Team** | |
| ## Model Origins | |
| ColQwen3 8B is based on the ColBERT late interaction paradigm, adapted for visual document retrieval using Qwen3-VL as the backbone. | |
| ### Base Models Merged | |
| 1. **tomoro-ai/Colqwen3-8B-base** - Foundation visual-language model | |
| 2. **Custom projection layers** - Trained for document embedding | |
| 3. **Visual processor** - Qwen3-VL image understanding | |
| ## Late Interaction (MaxSim) | |
| Unlike dense retrievers that produce single vectors, ColBERT-style models produce **token-level embeddings**: | |
| ``` | |
| Query: "financial report" | |
| β | |
| [emb_financial, emb_report] # N query tokens | |
| Document Page: | |
| β | |
| [emb_Q3, emb_revenue, emb_chart, ...] # M document tokens | |
| MaxSim Score = Ξ£ max(sim(q_i, d_j)) for all j | |
| = sum of best matches for each query token | |
| ``` | |
| This enables: | |
| - **Fine-grained matching** - individual terms matter | |
| - **Passage-level relevance** - not just document-level | |
| - **Interpretable scores** - which terms matched | |
| ## Projection Layers | |
| Raw embeddings from Qwen3-VL are 4096-dimensional. We project them down for efficiency: | |
| | Layer | Input Dim | Output Dim | Parameters | | |
| |-------|-----------|------------|------------| | |
| | 128D | 4096 | 128 | 524K | | |
| | 320D | 4096 | 320 | 1.3M | | |
| ### When to Use Each | |
| - **128D**: Real-time search, memory-constrained | |
| - **320D**: Batch indexing, quality-critical applications | |
| ## Image Processing Pipeline | |
| ``` | |
| PDF Page / Image | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Resize to 1024Γ1024 max β | |
| β (preserve aspect ratio) β | |
| ββββββββββββββββ¬βββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Qwen3-VL Vision Encoder β | |
| β Patch embedding + attention β | |
| ββββββββββββββββ¬βββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β <|image_pad|> token expand β | |
| β β Token-level embeddings β | |
| ββββββββββββββββ¬βββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Projection Layer β | |
| β 4096D β 128D/320D β | |
| ββββββββββββββββ¬βββββββββββββββ | |
| β | |
| Document Embedding | |
| [num_patches Γ dim] | |
| ``` | |
| ## Query Processing | |
| Text queries go through the language model only: | |
| ``` | |
| Query Text | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Tokenizer β | |
| β β Token IDs β | |
| ββββββββββββββββ¬βββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Qwen3-VL Text Encoder β | |
| β β Hidden states β | |
| ββββββββββββββββ¬βββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Projection Layer β | |
| β 4096D β 128D/320D β | |
| ββββββββββββββββ¬βββββββββββββββ | |
| β | |
| Query Embedding | |
| [num_tokens Γ dim] | |
| ``` | |
| ## Memory Layout | |
| On Apple Silicon (MLX): | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Unified Memory β | |
| βββββββββββββββββββββββββββββββββββββββ€ | |
| β Model weights ~17GB β | |
| β KV Cache ~1-2GB β | |
| β Projection layers ~5MB β | |
| β Working memory ~1GB β | |
| βββββββββββββββββββββββββββββββββββββββ€ | |
| β Total ~18-20GB β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Indexing Strategy | |
| For production deployment: | |
| 1. **Pre-compute document embeddings** (offline) | |
| 2. **Store in vector database** (LanceDB, Qdrant, etc.) | |
| 3. **Online query embedding** (real-time) | |
| 4. **MaxSim scoring** (can be batched) | |
| ```python | |
| # Indexing (offline) | |
| for page in pdf_pages: | |
| embedding = embedder.embed_image(page) | |
| vector_db.insert(doc_id, page_num, embedding) | |
| # Search (online) | |
| query_emb = embedder.embed_query(query_text) | |
| candidates = vector_db.search(query_emb, k=100) | |
| scores = [embedder.maxsim(query_emb, doc_emb) for doc_emb in candidates] | |
| ``` | |
| ## File Format | |
| Model weights use MLX-compatible safetensors: | |
| ``` | |
| model-00001-of-00007.safetensors # 5.0GB | |
| model-00002-of-00007.safetensors # 4.9GB | |
| model-00003-of-00007.safetensors # 4.8GB | |
| model-00004-of-00007.safetensors # 4.8GB | |
| model-00005-of-00007.safetensors # 5.0GB | |
| model-00006-of-00007.safetensors # 5.0GB | |
| model-00007-of-00007.safetensors # 3.2GB | |
| -------- | |
| Total: ~35GB | |
| ``` | |
| Projection layers are separate safetensors files for flexibility. | |
| --- | |
| **Co-Authored-By: [Maciej](void@div0.space) & [Klaudiusz](the1st@whoai.am)** | |