--- license: apache-2.0 base_model: - lightonai/GTE-ModernColBERT-v1 pipeline_tag: sentence-similarity tags: - SMVE - ColBERT - PyLate - sentence-transformers - sentence-similarity - feature-extraction datasets: - lightonai/ms-marco-en-bge-gemma language: - en ---

Looking for production ready multi-vector search? Check out TopK, hybrid retrieval engine build on object storage.

# Iso-ModernColBERT This model is an isotropically corrected version of [GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1). It's built for production use cases where retrieval speed and quality matter. Compared to the original model, this version delivers up to 3x faster inference in `bf16` with almost no loss in accuracy and enables scalable multi-vector retrieval through [Sparse Multi-Vector Encoding (SMVE)](https://www.topk.io/blog/20260311-smve-multi-vector-retrieval) inside [TopK](https://topk.io). ## Usage Install PyLate for embeddings and TopK SDK for retrieval. ``` pip install -U pylate topk-sdk ``` ### Embed documents First, load the model into PyLate `ColBERT` class and encode your documents. ```python import torch import numpy as np from pylate import models model = models.ColBERT( model_name_or_path="topk-io/Iso-ModernColBERT", model_kwargs={'torch_dtype': torch.bfloat16}, ) documents = [ "document 1 text", "document 2 text", "document 3 text", ] doc_embeddings = model.encode( documents, batch_size=32, # Ensure that it is set to False to indicate that these are documents, not queries is_query=False, show_progress_bar=True, ) ``` ### Store document embeddings Index multi-vector document embeddings inside [TopK](https://topk.io), hybrid retrieval engine built on object storage. To get started, [create an API key](https://console.topk.io). ```python from topk_sdk import Client from topk_sdk.schema import matrix, multi_vector_index # Initialize TopK client client = Client( api_key = "", region = "aws-us-east-1-elastica", ) # Create a collection with multi-vector index client.collections().create( "iso-moderncolbert", schema = { "token_embeddings": matrix(dimension=128, value_type="f16") .index(multi_vector_index(metric="maxsim")) } ) # Upsert document embeddings client.collection("iso-moderncolbert").upsert([ { "_id": str(i), "token_embeddings": emb.astype(np.float16), "text": text } for (i, (text, emb)) in enumerate(zip(documents, doc_embeddings)) ]) ``` ### Retrieve documents for queries Your documents are now durably persisted in the index and queryable. ```python from topk_sdk.query import fn, select, field # Encode query string query_embedding = model.encode( "query for document 3", # Ensure that it is set to True for queries is_query=True, show_progress_bar=False, ) # Retrieve top-k documents using the query embedding results = client.collection("iso-moderncolbert").query( select( "_id", "text", # Compute maxsim between query and indexed documents maxsim_score = fn.multi_vector_distance( "token_embeddings", query_embedding.astype(np.float16) ) ) # Get the top 10 matching documents .topk(field("maxsim_score"), 10) ) for r in results: print(f"id: {r['_id']}, score: {r['maxsim_score']}, text: {r['text']}") ``` TopK's query language is flexible and allows you to tune retrieval parameters, combine multi-vector with metadata filters, keyword search, and more. Check out our [docs](https://docs.topk.io) to learn more. # Evaluation results We conducted evaluation of our model using an internal evaluation harness on two standard benchmarks - BEIR and NanoBEIR. For baselines, we selected [GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) and evaluated its perfomance in fp32 and bf16 precision (denoted by `GTE fp32` and `GTE bf16`, respectively). The last two columns of each table — **Iso bf16** and **Δ vs GTE** — describe Iso-ModernColBERT (ours) in bf16 precision. In all configurations we used the same SMVE implementation with width 65536 and k=32. ## BEIR ### NDCG@10 — ranking quality is robust to bf16 End-to-end ranking quality reported as NDCG@10, using **exact MaxSim** scoring (no approximation). GTE-ModernColBERT-v1 loses ~7 NDCG points on average going from fp32 → bf16 — about a 13% relative drop — with the worst-hit datasets (trec-covid, climate-fever, hotpotqa) dropping 12–16 points. Iso-ModernColBERT keeps fp32-level ranking quality in bf16, recovering most of that gap on average and on every dataset. | dataset | GTE fp32 N@10 | GTE bf16 N@10 | **Iso bf16 N@10** | **Δ vs GTE bf16** | |---------------|--------------:|--------------:|------------------:|------------------:| | arguana | 35.81% | 30.35% | **34.63%** | **+14.10%** | | climate-fever | 32.44% | 19.49% | **31.62%** | **+62.24%** | | cqadupstack | 40.54% | 38.25% | **40.64%** | **+6.25%** | | dbpedia | 53.96% | 48.43% | **52.84%** | **+9.11%** | | fever | 88.80% | 80.67% | **87.08%** | **+7.95%** | | fiqa | 45.56% | 37.15% | **43.48%** | **+17.04%** | | hotpotqa | 78.36% | 66.74% | **75.85%** | **+13.65%** | | msmarco | 46.12% | 41.82% | **45.30%** | **+8.32%** | | nfcorpus | 37.81% | 35.98% | **37.31%** | **+3.70%** | | nq | 62.24% | 52.60% | **60.45%** | **+14.92%** | | quora | 86.63% | 79.58% | **85.05%** | **+6.87%** | | scidocs | 19.49% | 17.82% | **18.81%** | **+5.56%** | | scifact | 75.98% | 71.55% | **75.26%** | **+5.18%** | | touche2020 | 31.30% | 22.93% | **29.45%** | **+28.43%** | | trec-covid | 89.30% | 73.47% | **83.76%** | **+14.01%** | | **avg** | **54.96%** | **47.79%** | **53.44%** | **+11.82%** | ### Recall@100 — SMVE as a first stage with ~10× overfetch The following results show model performance when used with [Sparse Multi-Vector Encoder (SMVE)](https://www.topk.io/blog/20260311-smve-multi-vector-retrieval) as a first stage retriever. For a SMVE first stage to be usable, it needs to surface the candidates that the exact fp32 MaxSim model would have ranked at the top. SMVE on GTE-ModernColBERT-v1 is broken — its compacted latent geometry means random anchors don't separate vectors well. Iso-ModernColBERT's SMVE recovers (and often exceeds) the fp32 MaxSim top-10 within 10× overfetch. | dataset | GTE fp32 MaxSim R@10 | GTE fp32 SMVE R@100 | **Iso bf16 SMVE R@100** | **Δ vs GTE fp32 SMVE** | |---------------|---------------------:|--------------------:|------------------------:|-----------------------:| | arguana | 72.81% | 27.69% | **84.51%** | **+205.20%** | | climate-fever | 39.27% | 0.41% | **48.84%** | **+11,812%** ⚠ | | cqadupstack | 50.48% | 11.78% | **37.29%** | **+216.55%** | | dbpedia | 30.45% | 8.54% | **36.89%** | **+331.97%** | | fever | 94.20% | 10.05% | **94.31%** | **+838.41%** | | fiqa | 52.15% | 6.45% | **49.12%** | **+661.55%** | | hotpotqa | 80.73% | 12.29% | **66.59%** | **+441.82%** | | msmarco | 68.64% | 27.77% | **75.83%** | **+173.07%** | | nfcorpus | 18.03% | 16.63% | **25.60%** | **+53.94%** | | nq | 82.03% | 14.60% | **78.85%** | **+440.07%** | | quora | 94.92% | 43.73% | **82.86%** | **+89.48%** | | scidocs | 20.36% | 12.29% | **29.32%** | **+138.57%** | | scifact | 87.39% | 60.93% | **90.00%** | **+47.71%** | | touche2020 | 19.69% | 4.47% | **40.17%** | **+798.66%** | | trec-covid | 2.27% | 0.89% | **7.73%** | **+768.54%** | | **avg** | **54.23%** | **17.23%** | **56.53%** | **+228.09%** | > ⚠ The +11,812% on climate-fever is an artifact of a near-zero baseline (0.41%): GTE's SMVE is so broken on that dataset that the ratio explodes. Read it as *"GTE SMVE doesn't work here at all"*, not as a meaningful magnitude. ### Recall@1000 — SMVE as a first stage with ~10× overfetch (deeper pool) Same picture at the next pool depth: Iso-ModernColBERT SMVE R@1000 essentially matches or exceeds fp32 MaxSim R@100 across the board, while GTE's SMVE collapses. | dataset | GTE fp32 MaxSim R@100 | GTE fp32 SMVE R@1000 | **Iso bf16 SMVE R@1000** | **Δ vs GTE fp32 SMVE** | |---------------|----------------------:|---------------------:|-------------------------:|-----------------------:| | arguana | 95.72% | 68.31% | **97.00%** | **+42.00%** | | climate-fever | 66.45% | 0.93% | **68.87%** | **+7,305%** ⚠ | | cqadupstack | 71.44% | 26.78% | **55.78%** | **+108.29%** | | dbpedia | 62.50% | 18.35% | **57.72%** | **+214.55%** | | fever | 97.46% | 16.74% | **96.91%** | **+478.91%** | | fiqa | 75.64% | 21.09% | **76.70%** | **+263.68%** | | hotpotqa | 90.31% | 22.72% | **78.83%** | **+247.05%** | | msmarco | 93.14% | 46.57% | **90.97%** | **+95.34%** | | nfcorpus | 32.22% | 49.11% | **57.16%** | **+16.39%** | | nq | 96.59% | 29.88% | **91.42%** | **+205.96%** | | quora | 99.45% | 69.38% | **94.86%** | **+36.72%** | | scidocs | 44.07% | 32.62% | **53.43%** | **+63.80%** | | scifact | 96.00% | 89.82% | **99.33%** | **+10.59%** | | touche2020 | 52.60% | 13.91% | **69.63%** | **+400.58%** | | trec-covid | 16.02% | 3.85% | **29.57%** | **+668.05%** | | **avg** | **72.64%** | **34.00%** | **74.55%** | **+119.26%** | > ⚠ Again, climate-fever's +7,305% is driven by a near-zero baseline (0.93%) — GTE SMVE simply doesn't work on this dataset. ## NanoBEIR ### NDCG@10 — ranking quality is robust to bf16 End-to-end ranking quality reported as NDCG@10, using **exact MaxSim** scoring (no approximation). GTE-ModernColBERT-v1 drops ~6 NDCG points on average going from fp32 → bf16 — about a 9% relative drop — with some datasets (ArguAna, ClimateFEVER, FiQA, Touche2020) losing 8–13 points. Iso-ModernColBERT keeps fp32-level ranking quality in bf16 — average is within 0.6 points of fp32, and most per-dataset gaps close to a few percent. | dataset | GTE fp32 N@10 | GTE bf16 N@10 | **Iso bf16 N@10** | **Δ vs GTE bf16** | |----------------|--------------:|--------------:|------------------:|------------------:| | ArguAna | 51.98% | 43.50% | **54.31%** | **+24.85%** | | ClimateFEVER | 40.46% | 27.78% | **38.17%** | **+37.40%** | | DBPedia | 72.82% | 70.39% | **71.56%** | **+1.66%** | | FEVER | 94.52% | 89.82% | **93.23%** | **+3.80%** | | FiQA2018 | 56.64% | 44.13% | **55.79%** | **+26.42%** | | HotpotQA | 89.95% | 85.64% | **90.47%** | **+5.64%** | | MSMARCO | 70.89% | 68.77% | **72.56%** | **+5.51%** | | NFCorpus | 39.58% | 39.20% | **38.67%** | **-1.35%** | | NQ | 77.19% | 69.01% | **73.64%** | **+6.71%** | | QuoraRetrieval | 97.08% | 90.60% | **96.53%** | **+6.54%** | | SCIDOCS | 39.85% | 38.02% | **38.14%** | **+0.32%** | | SciFact | 82.98% | 80.45% | **83.32%** | **+3.57%** | | Touche2020 | 59.34% | 48.67% | **58.77%** | **+20.75%** | | **avg** | **67.18%** | **61.23%** | **66.55%** | **+8.69%** | ### Recall@100 — SMVE as a first stage with ~10× overfetch The following results show model performance when used with [Sparse Multi-Vector Encoder (SMVE)](https://www.topk.io/blog/20260311-smve-multi-vector-retrieval) as a first stage retriever. For a SMVE first stage to be usable, it needs to surface the candidates that the exact fp32 MaxSim model would have ranked at the top. SMVE on GTE-ModernColBERT-v1 is broken — its compacted latent geometry means random anchors don't separate vectors well. Iso-ModernColBERT's SMVE recovers (and often exceeds) fp32 MaxSim's top-10 within 10× overfetch. | dataset | GTE fp32 MaxSim R@10 | GTE fp32 SMVE R@100 | **Iso bf16 SMVE R@100** | **Δ vs GTE fp32 SMVE** | |----------------|---------------------:|--------------------:|------------------------:|-----------------------:| | ArguAna | 80.00% | 32.00% | **90.00%** | **+181.25%** | | ClimateFEVER | 47.07% | 20.67% | **66.97%** | **+224.00%** | | DBPedia | 41.21% | 49.00% | **72.85%** | **+48.67%** | | FEVER | 98.00% | 61.33% | **98.00%** | **+59.79%** | | FiQA2018 | 64.12% | 23.25% | **78.93%** | **+239.48%** | | HotpotQA | 92.00% | 46.00% | **90.00%** | **+95.65%** | | MSMARCO | 92.00% | 84.00% | **98.00%** | **+16.67%** | | NFCorpus | 15.66% | 16.33% | **24.58%** | **+50.52%** | | NQ | 88.00% | 70.00% | **95.00%** | **+35.71%** | | QuoraRetrieval | 98.93% | 87.93% | **96.60%** | **+9.86%** | | SCIDOCS | 39.67% | 37.87% | **61.17%** | **+61.53%** | | SciFact | 93.00% | 57.50% | **92.00%** | **+60.00%** | | Touche2020 | 33.52% | 33.55% | **69.86%** | **+108.23%** | | **avg** | **67.94%** | **47.65%** | **79.53%** | **+66.91%** | ### Recall@1000 — SMVE as a first stage with ~10× overfetch (deeper pool) Same picture at the next pool depth: Iso-ModernColBERT SMVE R@1000 essentially matches or exceeds fp32 MaxSim R@100 across the board, while GTE's SMVE consistently undershoots. | dataset | GTE fp32 MaxSim R@100 | GTE fp32 SMVE R@1000 | **Iso bf16 SMVE R@1000** | **Δ vs GTE fp32 SMVE** | |----------------|----------------------:|---------------------:|-------------------------:|-----------------------:| | ArguAna | 96.00% | 80.00% | **100.00%** | **+25.00%** | | ClimateFEVER | 81.17% | 68.80% | **89.03%** | **+29.40%** | | DBPedia | 85.58% | 84.85% | **96.20%** | **+13.38%** | | FEVER | 100.00% | 94.33% | **99.00%** | **+4.95%** | | FiQA2018 | 86.82% | 72.61% | **91.35%** | **+25.81%** | | HotpotQA | 97.00% | 84.00% | **98.00%** | **+16.67%** | | MSMARCO | 100.00% | 98.00% | **100.00%** | **+2.04%** | | NFCorpus | 30.55% | 52.82% | **59.33%** | **+12.32%** | | NQ | 100.00% | 91.00% | **100.00%** | **+9.89%** | | QuoraRetrieval | 100.00% | 96.00% | **100.00%** | **+4.17%** | | SCIDOCS | 70.67% | 78.93% | **90.80%** | **+15.04%** | | SciFact | 96.00% | 93.00% | **100.00%** | **+7.53%** | | Touche2020 | 77.23% | 80.46% | **93.09%** | **+15.70%** | | **avg** | **86.23%** | **82.68%** | **93.60%** | **+13.21%** |