ColQwen3 Architecture
Created by M&K (c)2025 The LibraxisAI Team
Model Origins
ColQwen3 8B is based on the ColBERT late interaction paradigm, adapted for visual document retrieval using Qwen3-VL as the backbone.
Base Models Merged
- tomoro-ai/Colqwen3-8B-base - Foundation visual-language model
- Custom projection layers - Trained for document embedding
- Visual processor - Qwen3-VL image understanding
Late Interaction (MaxSim)
Unlike dense retrievers that produce single vectors, ColBERT-style models produce token-level embeddings:
Query: "financial report"
β
[emb_financial, emb_report] # N query tokens
Document Page:
β
[emb_Q3, emb_revenue, emb_chart, ...] # M document tokens
MaxSim Score = Ξ£ max(sim(q_i, d_j)) for all j
= sum of best matches for each query token
This enables:
- Fine-grained matching - individual terms matter
- Passage-level relevance - not just document-level
- Interpretable scores - which terms matched
Projection Layers
Raw embeddings from Qwen3-VL are 4096-dimensional. We project them down for efficiency:
| Layer | Input Dim | Output Dim | Parameters |
|---|---|---|---|
| 128D | 4096 | 128 | 524K |
| 320D | 4096 | 320 | 1.3M |
When to Use Each
- 128D: Real-time search, memory-constrained
- 320D: Batch indexing, quality-critical applications
Image Processing Pipeline
PDF Page / Image
β
βΌ
βββββββββββββββββββββββββββββββ
β Resize to 1024Γ1024 max β
β (preserve aspect ratio) β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Qwen3-VL Vision Encoder β
β Patch embedding + attention β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β <|image_pad|> token expand β
β β Token-level embeddings β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Projection Layer β
β 4096D β 128D/320D β
ββββββββββββββββ¬βββββββββββββββ
β
Document Embedding
[num_patches Γ dim]
Query Processing
Text queries go through the language model only:
Query Text
β
βΌ
βββββββββββββββββββββββββββββββ
β Tokenizer β
β β Token IDs β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Qwen3-VL Text Encoder β
β β Hidden states β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Projection Layer β
β 4096D β 128D/320D β
ββββββββββββββββ¬βββββββββββββββ
β
Query Embedding
[num_tokens Γ dim]
Memory Layout
On Apple Silicon (MLX):
βββββββββββββββββββββββββββββββββββββββ
β Unified Memory β
βββββββββββββββββββββββββββββββββββββββ€
β Model weights ~17GB β
β KV Cache ~1-2GB β
β Projection layers ~5MB β
β Working memory ~1GB β
βββββββββββββββββββββββββββββββββββββββ€
β Total ~18-20GB β
βββββββββββββββββββββββββββββββββββββββ
Indexing Strategy
For production deployment:
- Pre-compute document embeddings (offline)
- Store in vector database (LanceDB, Qdrant, etc.)
- Online query embedding (real-time)
- MaxSim scoring (can be batched)
# Indexing (offline)
for page in pdf_pages:
embedding = embedder.embed_image(page)
vector_db.insert(doc_id, page_num, embedding)
# Search (online)
query_emb = embedder.embed_query(query_text)
candidates = vector_db.search(query_emb, k=100)
scores = [embedder.maxsim(query_emb, doc_emb) for doc_emb in candidates]
File Format
Model weights use MLX-compatible safetensors:
model-00001-of-00007.safetensors # 5.0GB
model-00002-of-00007.safetensors # 4.9GB
model-00003-of-00007.safetensors # 4.8GB
model-00004-of-00007.safetensors # 4.8GB
model-00005-of-00007.safetensors # 5.0GB
model-00006-of-00007.safetensors # 5.0GB
model-00007-of-00007.safetensors # 3.2GB
--------
Total: ~35GB
Projection layers are separate safetensors files for flexibility.