colqwen3-8b-vetcoders-mlx / docs /ARCHITECTURE.md
div0-space's picture
Upload folder using huggingface_hub
c6c3a3b verified
# ColQwen3 Architecture
**Created by M&K (c)2025 The LibraxisAI Team**
## Model Origins
ColQwen3 8B is based on the ColBERT late interaction paradigm, adapted for visual document retrieval using Qwen3-VL as the backbone.
### Base Models Merged
1. **tomoro-ai/Colqwen3-8B-base** - Foundation visual-language model
2. **Custom projection layers** - Trained for document embedding
3. **Visual processor** - Qwen3-VL image understanding
## Late Interaction (MaxSim)
Unlike dense retrievers that produce single vectors, ColBERT-style models produce **token-level embeddings**:
```
Query: "financial report"
↓
[emb_financial, emb_report] # N query tokens
Document Page:
↓
[emb_Q3, emb_revenue, emb_chart, ...] # M document tokens
MaxSim Score = Ξ£ max(sim(q_i, d_j)) for all j
= sum of best matches for each query token
```
This enables:
- **Fine-grained matching** - individual terms matter
- **Passage-level relevance** - not just document-level
- **Interpretable scores** - which terms matched
## Projection Layers
Raw embeddings from Qwen3-VL are 4096-dimensional. We project them down for efficiency:
| Layer | Input Dim | Output Dim | Parameters |
|-------|-----------|------------|------------|
| 128D | 4096 | 128 | 524K |
| 320D | 4096 | 320 | 1.3M |
### When to Use Each
- **128D**: Real-time search, memory-constrained
- **320D**: Batch indexing, quality-critical applications
## Image Processing Pipeline
```
PDF Page / Image
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Resize to 1024Γ—1024 max β”‚
β”‚ (preserve aspect ratio) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Qwen3-VL Vision Encoder β”‚
β”‚ Patch embedding + attention β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ <|image_pad|> token expand β”‚
β”‚ β†’ Token-level embeddings β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Projection Layer β”‚
β”‚ 4096D β†’ 128D/320D β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
Document Embedding
[num_patches Γ— dim]
```
## Query Processing
Text queries go through the language model only:
```
Query Text
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Tokenizer β”‚
β”‚ β†’ Token IDs β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Qwen3-VL Text Encoder β”‚
β”‚ β†’ Hidden states β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Projection Layer β”‚
β”‚ 4096D β†’ 128D/320D β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
Query Embedding
[num_tokens Γ— dim]
```
## Memory Layout
On Apple Silicon (MLX):
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Unified Memory β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Model weights ~17GB β”‚
β”‚ KV Cache ~1-2GB β”‚
β”‚ Projection layers ~5MB β”‚
β”‚ Working memory ~1GB β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total ~18-20GB β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Indexing Strategy
For production deployment:
1. **Pre-compute document embeddings** (offline)
2. **Store in vector database** (LanceDB, Qdrant, etc.)
3. **Online query embedding** (real-time)
4. **MaxSim scoring** (can be batched)
```python
# Indexing (offline)
for page in pdf_pages:
embedding = embedder.embed_image(page)
vector_db.insert(doc_id, page_num, embedding)
# Search (online)
query_emb = embedder.embed_query(query_text)
candidates = vector_db.search(query_emb, k=100)
scores = [embedder.maxsim(query_emb, doc_emb) for doc_emb in candidates]
```
## File Format
Model weights use MLX-compatible safetensors:
```
model-00001-of-00007.safetensors # 5.0GB
model-00002-of-00007.safetensors # 4.9GB
model-00003-of-00007.safetensors # 4.8GB
model-00004-of-00007.safetensors # 4.8GB
model-00005-of-00007.safetensors # 5.0GB
model-00006-of-00007.safetensors # 5.0GB
model-00007-of-00007.safetensors # 3.2GB
--------
Total: ~35GB
```
Projection layers are separate safetensors files for flexibility.
---
**Co-Authored-By: [Maciej](void@div0.space) & [Klaudiusz](the1st@whoai.am)**