colqwen3-8b-vetcoders-mlx / docs /ARCHITECTURE.md
div0-space's picture
Upload folder using huggingface_hub
c6c3a3b verified

ColQwen3 Architecture

Created by M&K (c)2025 The LibraxisAI Team

Model Origins

ColQwen3 8B is based on the ColBERT late interaction paradigm, adapted for visual document retrieval using Qwen3-VL as the backbone.

Base Models Merged

  1. tomoro-ai/Colqwen3-8B-base - Foundation visual-language model
  2. Custom projection layers - Trained for document embedding
  3. Visual processor - Qwen3-VL image understanding

Late Interaction (MaxSim)

Unlike dense retrievers that produce single vectors, ColBERT-style models produce token-level embeddings:

Query: "financial report"
        ↓
[emb_financial, emb_report]  # N query tokens

Document Page:
        ↓
[emb_Q3, emb_revenue, emb_chart, ...]  # M document tokens

MaxSim Score = Ξ£ max(sim(q_i, d_j)) for all j
             = sum of best matches for each query token

This enables:

  • Fine-grained matching - individual terms matter
  • Passage-level relevance - not just document-level
  • Interpretable scores - which terms matched

Projection Layers

Raw embeddings from Qwen3-VL are 4096-dimensional. We project them down for efficiency:

Layer Input Dim Output Dim Parameters
128D 4096 128 524K
320D 4096 320 1.3M

When to Use Each

  • 128D: Real-time search, memory-constrained
  • 320D: Batch indexing, quality-critical applications

Image Processing Pipeline

PDF Page / Image
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Resize to 1024Γ—1024 max     β”‚
β”‚ (preserve aspect ratio)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Qwen3-VL Vision Encoder     β”‚
β”‚ Patch embedding + attention β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ <|image_pad|> token expand  β”‚
β”‚ β†’ Token-level embeddings    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Projection Layer            β”‚
β”‚ 4096D β†’ 128D/320D           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
         Document Embedding
         [num_patches Γ— dim]

Query Processing

Text queries go through the language model only:

Query Text
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Tokenizer                   β”‚
β”‚ β†’ Token IDs                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Qwen3-VL Text Encoder       β”‚
β”‚ β†’ Hidden states             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Projection Layer            β”‚
β”‚ 4096D β†’ 128D/320D           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
         Query Embedding
         [num_tokens Γ— dim]

Memory Layout

On Apple Silicon (MLX):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Unified Memory                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Model weights        ~17GB          β”‚
β”‚ KV Cache            ~1-2GB          β”‚
β”‚ Projection layers   ~5MB            β”‚
β”‚ Working memory      ~1GB            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total              ~18-20GB         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Indexing Strategy

For production deployment:

  1. Pre-compute document embeddings (offline)
  2. Store in vector database (LanceDB, Qdrant, etc.)
  3. Online query embedding (real-time)
  4. MaxSim scoring (can be batched)
# Indexing (offline)
for page in pdf_pages:
    embedding = embedder.embed_image(page)
    vector_db.insert(doc_id, page_num, embedding)

# Search (online)
query_emb = embedder.embed_query(query_text)
candidates = vector_db.search(query_emb, k=100)
scores = [embedder.maxsim(query_emb, doc_emb) for doc_emb in candidates]

File Format

Model weights use MLX-compatible safetensors:

model-00001-of-00007.safetensors  # 5.0GB
model-00002-of-00007.safetensors  # 4.9GB
model-00003-of-00007.safetensors  # 4.8GB
model-00004-of-00007.safetensors  # 4.8GB
model-00005-of-00007.safetensors  # 5.0GB
model-00006-of-00007.safetensors  # 5.0GB
model-00007-of-00007.safetensors  # 3.2GB
                                  --------
                           Total: ~35GB

Projection layers are separate safetensors files for flexibility.


Co-Authored-By: Maciej & Klaudiusz