# ColQwen3 Architecture **Created by M&K (c)2025 The LibraxisAI Team** ## Model Origins ColQwen3 8B is based on the ColBERT late interaction paradigm, adapted for visual document retrieval using Qwen3-VL as the backbone. ### Base Models Merged 1. **tomoro-ai/Colqwen3-8B-base** - Foundation visual-language model 2. **Custom projection layers** - Trained for document embedding 3. **Visual processor** - Qwen3-VL image understanding ## Late Interaction (MaxSim) Unlike dense retrievers that produce single vectors, ColBERT-style models produce **token-level embeddings**: ``` Query: "financial report" ↓ [emb_financial, emb_report] # N query tokens Document Page: ↓ [emb_Q3, emb_revenue, emb_chart, ...] # M document tokens MaxSim Score = Σ max(sim(q_i, d_j)) for all j = sum of best matches for each query token ``` This enables: - **Fine-grained matching** - individual terms matter - **Passage-level relevance** - not just document-level - **Interpretable scores** - which terms matched ## Projection Layers Raw embeddings from Qwen3-VL are 4096-dimensional. We project them down for efficiency: | Layer | Input Dim | Output Dim | Parameters | |-------|-----------|------------|------------| | 128D | 4096 | 128 | 524K | | 320D | 4096 | 320 | 1.3M | ### When to Use Each - **128D**: Real-time search, memory-constrained - **320D**: Batch indexing, quality-critical applications ## Image Processing Pipeline ``` PDF Page / Image │ ▼ ┌─────────────────────────────┐ │ Resize to 1024×1024 max │ │ (preserve aspect ratio) │ └──────────────┬──────────────┘ │ ▼ ┌─────────────────────────────┐ │ Qwen3-VL Vision Encoder │ │ Patch embedding + attention │ └──────────────┬──────────────┘ │ ▼ ┌─────────────────────────────┐ │ <|image_pad|> token expand │ │ → Token-level embeddings │ └──────────────┬──────────────┘ │ ▼ ┌─────────────────────────────┐ │ Projection Layer │ │ 4096D → 128D/320D │ └──────────────┬──────────────┘ │ Document Embedding [num_patches × dim] ``` ## Query Processing Text queries go through the language model only: ``` Query Text │ ▼ ┌─────────────────────────────┐ │ Tokenizer │ │ → Token IDs │ └──────────────┬──────────────┘ │ ▼ ┌─────────────────────────────┐ │ Qwen3-VL Text Encoder │ │ → Hidden states │ └──────────────┬──────────────┘ │ ▼ ┌─────────────────────────────┐ │ Projection Layer │ │ 4096D → 128D/320D │ └──────────────┬──────────────┘ │ Query Embedding [num_tokens × dim] ``` ## Memory Layout On Apple Silicon (MLX): ``` ┌─────────────────────────────────────┐ │ Unified Memory │ ├─────────────────────────────────────┤ │ Model weights ~17GB │ │ KV Cache ~1-2GB │ │ Projection layers ~5MB │ │ Working memory ~1GB │ ├─────────────────────────────────────┤ │ Total ~18-20GB │ └─────────────────────────────────────┘ ``` ## Indexing Strategy For production deployment: 1. **Pre-compute document embeddings** (offline) 2. **Store in vector database** (LanceDB, Qdrant, etc.) 3. **Online query embedding** (real-time) 4. **MaxSim scoring** (can be batched) ```python # Indexing (offline) for page in pdf_pages: embedding = embedder.embed_image(page) vector_db.insert(doc_id, page_num, embedding) # Search (online) query_emb = embedder.embed_query(query_text) candidates = vector_db.search(query_emb, k=100) scores = [embedder.maxsim(query_emb, doc_emb) for doc_emb in candidates] ``` ## File Format Model weights use MLX-compatible safetensors: ``` model-00001-of-00007.safetensors # 5.0GB model-00002-of-00007.safetensors # 4.9GB model-00003-of-00007.safetensors # 4.8GB model-00004-of-00007.safetensors # 4.8GB model-00005-of-00007.safetensors # 5.0GB model-00006-of-00007.safetensors # 5.0GB model-00007-of-00007.safetensors # 3.2GB -------- Total: ~35GB ``` Projection layers are separate safetensors files for flexibility. --- **Co-Authored-By: [Maciej](void@div0.space) & [Klaudiusz](the1st@whoai.am)**