Colbert (multi-vec)
Collection
6 items
•
Updated
This is a PyLate model finetuned from skt/A.X-Encoder-base on the parquet dataset. It maps sentences & paragraphs to sequences of 32, 64, 96, 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
ColBERTWrapper(
(0): Transformer({'max_seq_length': 2047, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
(1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
)
This model supports Matryoshka embeddings with multiple dimensions (32, 64, 96, 128) using separate projection heads (Jina-ColBERT-v2 style).
pip install colbert-matryoshka
from colbert_matryoshka import MatryoshkaColBERT
# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
# Set embedding dimension (32, 64, 96, or 128)
model.set_active_dim(128)
# Encode queries and documents
query_embeddings = model.encode(["검색 쿼리"], is_query=True)
doc_embeddings = model.encode(["문서 내용"], is_query=False)
print(f"Query shape: {query_embeddings[0].shape}") # (num_tokens, 128)
print(f"Doc shape: {doc_embeddings[0].shape}") # (num_tokens, 128)
Use this model with PyLate to index and retrieve documents. The index uses FastPLAID for efficient similarity search.
from colbert_matryoshka import MatryoshkaColBERT
from pylate import indexes, retrieve
# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
model.set_active_dim(128)
# Initialize PLAID index
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
override=True,
)
# Encode and index documents
documents_ids = ["1", "2", "3"]
documents = ["첫번째 문서입니다", "두번째 문서입니다", "세번째 문서입니다"]
documents_embeddings = model.encode(documents, is_query=False)
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
# Retrieve
retriever = retrieve.ColBERT(index=index)
queries_embeddings = model.encode(["첫번째 문서 검색"], is_query=True)
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=3,
)
print(scores)
# [[{'id': '1', 'score': 24.51}, {'id': '2', 'score': 23.54}, {'id': '3', 'score': 23.33}]]
from colbert_matryoshka import MatryoshkaColBERT
from pylate import rank
# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
model.set_active_dim(128)
queries = ["인공지능 기술", "한국어 자연어처리"]
documents = [
["AI와 머신러닝에 대한 문서", "요리 레시피 문서"],
["한국어 NLP 연구", "영어 문법 설명", "프로그래밍 튜토리얼"],
]
documents_ids = [
[1, 2],
[1, 3, 2],
]
# Encode queries
queries_embeddings = model.encode(queries, is_query=True)
# Encode documents (per query)
documents_embeddings = []
for docs in documents:
documents_embeddings.append(model.encode(docs, is_query=False))
# Rerank
reranked_documents = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
print(reranked_documents)
# Query "인공지능 기술": [{'id': 1, 'score': 3.63}, {'id': 2, 'score': 0.90}]
# Query "한국어 자연어처리": [{'id': 1, 'score': 4.60}, {'id': 3, 'score': 3.88}, {'id': 2, 'score': 1.93}]
| Model | AutoRAG | Ko-StrategyQA | NanoBEIR-Ko | Avg |
|---|---|---|---|---|
| dragonkue-colbert-ko-0.1b (149M) | 0.989 | 0.741 | 0.519 | 0.750 |
| BGE-M3-MultiVec (568M) | 0.844 | 0.797 | 0.569 | 0.737 |
| LFM2-ColBERT (353M) | 0.833 | 0.757 | 0.528 | 0.706 |
| colbert-ko-v1 (149M) | 0.966 | 0.713 | 0.476 | 0.718 |
| Dimension | AutoRAG | Ko-StrategyQA | NanoBEIR-Ko |
|---|---|---|---|
| 32 | 0.983 | 0.721 | 0.504 |
| 64 | 0.985 | 0.728 | 0.510 |
| 96 | 0.979 | 0.736 | 0.517 |
| 128 | 0.989 | 0.741 | 0.519 |
src.losses.MatryoshkaColBERTLoss with these parameters:{
"dims": [
32,
64,
96,
128
],
"weights": [
0.25,
0.25,
0.25,
0.25
],
"temperature": 1.0
}
eval_strategy: stepsper_device_train_batch_size: 64learning_rate: 1e-05num_train_epochs: 2warmup_ratio: 0.1fp16: Truedataloader_drop_last: Truegradient_checkpointing: Truegradient_checkpointing_kwargs: {'use_reentrant': False}router_mapping: {'anchor': 'query', 'positive': 'document', 'neg_0': 'document'}overwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 64per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 1e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 2max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Truefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Truedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Truegradient_checkpointing_kwargs: {'use_reentrant': False}include_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportionalrouter_mapping: {'anchor': 'query', 'positive': 'document', 'neg_0': 'document'}learning_rate_mapping: {}| Epoch | Step | Training Loss |
|---|---|---|
| 0.0017 | 10 | 4.1388 |
| 0.0034 | 20 | 4.1142 |
| 0.0051 | 30 | 3.9797 |
| 0.0067 | 40 | 3.8761 |
| 0.0084 | 50 | 3.6167 |
| 0.0101 | 60 | 3.424 |
| 0.0118 | 70 | 3.0256 |
| 0.0135 | 80 | 2.827 |
| 0.0152 | 90 | 2.5787 |
| 0.0169 | 100 | 2.2696 |
| 0.0185 | 110 | 2.0266 |
| 0.0202 | 120 | 1.6815 |
| 0.0219 | 130 | 1.4739 |
| 0.0236 | 140 | 1.2877 |
| 0.0253 | 150 | 1.1474 |
| 0.0270 | 160 | 1.0143 |
| 0.0286 | 170 | 0.9363 |
| 0.0303 | 180 | 0.9189 |
| 0.0320 | 190 | 0.7442 |
| 0.0337 | 200 | 0.6919 |
| 0.0354 | 210 | 0.6251 |
| 0.0371 | 220 | 0.6527 |
| 0.0388 | 230 | 0.5923 |
| 0.0404 | 240 | 0.572 |
| 0.0421 | 250 | 0.5255 |
| 0.0438 | 260 | 0.4407 |
| 0.0455 | 270 | 0.5038 |
| 0.0472 | 280 | 0.3939 |
| 0.0489 | 290 | 0.3938 |
| 0.0506 | 300 | 0.3253 |
| 0.0522 | 310 | 0.335 |
| 0.0539 | 320 | 0.2855 |
| 0.0556 | 330 | 0.2396 |
| 0.0573 | 340 | 0.252 |
| 0.0590 | 350 | 0.2299 |
| 0.0607 | 360 | 0.2133 |
| 0.0624 | 370 | 0.2186 |
| 0.0640 | 380 | 0.1935 |
| 0.0657 | 390 | 0.1743 |
| 0.0674 | 400 | 0.1462 |
| 0.0691 | 410 | 0.1552 |
| 0.0708 | 420 | 0.1491 |
| 0.0725 | 430 | 0.1581 |
| 0.0741 | 440 | 0.1635 |
| 0.0758 | 450 | 0.1383 |
| 0.0775 | 460 | 0.1377 |
| 0.0792 | 470 | 0.1155 |
| 0.0809 | 480 | 0.1184 |
| 0.0826 | 490 | 0.1333 |
| 0.0843 | 500 | 0.1341 |
| 0.0859 | 510 | 0.1259 |
| 0.0876 | 520 | 0.0748 |
| 0.0893 | 530 | 0.1342 |
| 0.0910 | 540 | 0.1058 |
| 0.0927 | 550 | 0.1024 |
| 0.0944 | 560 | 0.0921 |
| 0.0961 | 570 | 0.104 |
| 0.0977 | 580 | 0.1069 |
| 0.0994 | 590 | 0.0925 |
| 0.1011 | 600 | 0.1146 |
| 0.1028 | 610 | 0.0682 |
| 0.1045 | 620 | 0.0711 |
| 0.1062 | 630 | 0.1491 |
| 0.1079 | 640 | 0.0602 |
| 0.1095 | 650 | 0.0753 |
| 0.1112 | 660 | 0.0713 |
| 0.1129 | 670 | 0.0739 |
| 0.1146 | 680 | 0.0783 |
| 0.1163 | 690 | 0.0678 |
| 0.1180 | 700 | 0.0963 |
| 0.1196 | 710 | 0.0677 |
| 0.1213 | 720 | 0.0829 |
| 0.1230 | 730 | 0.0719 |
| 0.1247 | 740 | 0.0646 |
| 0.1264 | 750 | 0.0927 |
| 0.1281 | 760 | 0.0755 |
| 0.1298 | 770 | 0.0799 |
| 0.1314 | 780 | 0.0535 |
| 0.1331 | 790 | 0.0555 |
| 0.1348 | 800 | 0.0804 |
| 0.1365 | 810 | 0.0627 |
| 0.1382 | 820 | 0.0726 |
| 0.1399 | 830 | 0.0685 |
| 0.1416 | 840 | 0.0421 |
| 0.1432 | 850 | 0.0895 |
| 0.1449 | 860 | 0.0964 |
| 0.1466 | 870 | 0.0515 |
| 0.1483 | 880 | 0.0825 |
| 0.1500 | 890 | 0.0801 |
| 0.1517 | 900 | 0.0579 |
| 0.1534 | 910 | 0.0559 |
| 0.1550 | 920 | 0.0432 |
| 0.1567 | 930 | 0.0553 |
| 0.1584 | 940 | 0.0577 |
| 0.1601 | 950 | 0.0451 |
| 0.1618 | 960 | 0.049 |
| 0.1635 | 970 | 0.0459 |
| 0.1651 | 980 | 0.0684 |
| 0.1668 | 990 | 0.0449 |
| 0.1685 | 1000 | 0.0392 |
| 0.1702 | 1010 | 0.071 |
| 0.1719 | 1020 | 0.0511 |
| 0.1736 | 1030 | 0.0501 |
| 0.1753 | 1040 | 0.0464 |
| 0.1769 | 1050 | 0.0678 |
| 0.1786 | 1060 | 0.0597 |
| 0.1803 | 1070 | 0.0569 |
| 0.1820 | 1080 | 0.044 |
| 0.1837 | 1090 | 0.0452 |
| 0.1854 | 1100 | 0.0394 |
| 0.1871 | 1110 | 0.0496 |
| 0.1887 | 1120 | 0.0296 |
| 0.1904 | 1130 | 0.0321 |
| 0.1921 | 1140 | 0.0525 |
| 0.1938 | 1150 | 0.058 |
| 0.1955 | 1160 | 0.0552 |
| 0.1972 | 1170 | 0.035 |
| 0.1989 | 1180 | 0.0468 |
| 0.1999 | 1186 | - |
| 0.2005 | 1190 | 0.0383 |
| 0.2022 | 1200 | 0.0599 |
| 0.2039 | 1210 | 0.0572 |
| 0.2056 | 1220 | 0.0383 |
| 0.2073 | 1230 | 0.0486 |
| 0.2090 | 1240 | 0.0407 |
| 0.2107 | 1250 | 0.044 |
| 0.2123 | 1260 | 0.04 |
| 0.2140 | 1270 | 0.0338 |
| 0.2157 | 1280 | 0.036 |
| 0.2174 | 1290 | 0.0511 |
| 0.2191 | 1300 | 0.0472 |
| 0.2208 | 1310 | 0.031 |
| 0.2224 | 1320 | 0.0614 |
| 0.2241 | 1330 | 0.0388 |
| 0.2258 | 1340 | 0.0403 |
| 0.2275 | 1350 | 0.047 |
| 0.2292 | 1360 | 0.033 |
| 0.2309 | 1370 | 0.0524 |
| 0.2326 | 1380 | 0.0357 |
| 0.2342 | 1390 | 0.0463 |
| 0.2359 | 1400 | 0.0355 |
| 0.2376 | 1410 | 0.0411 |
| 0.2393 | 1420 | 0.028 |
| 0.2410 | 1430 | 0.0386 |
| 0.2427 | 1440 | 0.0553 |
| 0.2444 | 1450 | 0.0353 |
| 0.2460 | 1460 | 0.0462 |
| 0.2477 | 1470 | 0.0399 |
| 0.2494 | 1480 | 0.0319 |
| 0.2511 | 1490 | 0.0456 |
| 0.2528 | 1500 | 0.0302 |
| 0.2545 | 1510 | 0.0366 |
| 0.2562 | 1520 | 0.0409 |
| 0.2578 | 1530 | 0.0337 |
| 0.2595 | 1540 | 0.0362 |
| 0.2612 | 1550 | 0.0318 |
| 0.2629 | 1560 | 0.0433 |
| 0.2646 | 1570 | 0.0379 |
| 0.2663 | 1580 | 0.0419 |
| 0.2679 | 1590 | 0.0225 |
| 0.2696 | 1600 | 0.0269 |
| 0.2713 | 1610 | 0.0295 |
| 0.2730 | 1620 | 0.048 |
| 0.2747 | 1630 | 0.0382 |
| 0.2764 | 1640 | 0.0341 |
| 0.2781 | 1650 | 0.0334 |
| 0.2797 | 1660 | 0.0534 |
| 0.2814 | 1670 | 0.0445 |
| 0.2831 | 1680 | 0.0284 |
| 0.2848 | 1690 | 0.0327 |
| 0.2865 | 1700 | 0.0309 |
| 0.2882 | 1710 | 0.0372 |
| 0.2899 | 1720 | 0.0384 |
| 0.2915 | 1730 | 0.022 |
| 0.2932 | 1740 | 0.0266 |
| 0.2949 | 1750 | 0.0399 |
| 0.2966 | 1760 | 0.0342 |
| 0.2983 | 1770 | 0.0391 |
| 0.3000 | 1780 | 0.0349 |
| 0.3017 | 1790 | 0.0365 |
| 0.3033 | 1800 | 0.0322 |
| 0.3050 | 1810 | 0.0414 |
| 0.3067 | 1820 | 0.0297 |
| 0.3084 | 1830 | 0.0446 |
| 0.3101 | 1840 | 0.0312 |
| 0.3118 | 1850 | 0.0379 |
| 0.3134 | 1860 | 0.0252 |
| 0.3151 | 1870 | 0.0424 |
| 0.3168 | 1880 | 0.0367 |
| 0.3185 | 1890 | 0.0226 |
| 0.3202 | 1900 | 0.0319 |
| 0.3219 | 1910 | 0.0189 |
| 0.3236 | 1920 | 0.0219 |
| 0.3252 | 1930 | 0.0341 |
| 0.3269 | 1940 | 0.0505 |
| 0.3286 | 1950 | 0.0176 |
| 0.3303 | 1960 | 0.0328 |
| 0.3320 | 1970 | 0.0276 |
| 0.3337 | 1980 | 0.0251 |
| 0.3354 | 1990 | 0.0603 |
| 0.3370 | 2000 | 0.0243 |
| 0.3387 | 2010 | 0.0316 |
| 0.3404 | 2020 | 0.0294 |
| 0.3421 | 2030 | 0.025 |
| 0.3438 | 2040 | 0.0255 |
| 0.3455 | 2050 | 0.0318 |
| 0.3472 | 2060 | 0.025 |
| 0.3488 | 2070 | 0.0273 |
| 0.3505 | 2080 | 0.0338 |
| 0.3522 | 2090 | 0.0299 |
| 0.3539 | 2100 | 0.0275 |
| 0.3556 | 2110 | 0.0184 |
| 0.3573 | 2120 | 0.0244 |
| 0.3589 | 2130 | 0.0432 |
| 0.3606 | 2140 | 0.0325 |
| 0.3623 | 2150 | 0.0525 |
| 0.3640 | 2160 | 0.0329 |
| 0.3657 | 2170 | 0.0236 |
| 0.3674 | 2180 | 0.0309 |
| 0.3691 | 2190 | 0.0195 |
| 0.3707 | 2200 | 0.0318 |
| 0.3724 | 2210 | 0.0229 |
| 0.3741 | 2220 | 0.0312 |
| 0.3758 | 2230 | 0.0186 |
| 0.3775 | 2240 | 0.0231 |
| 0.3792 | 2250 | 0.0262 |
| 0.3809 | 2260 | 0.0287 |
| 0.3825 | 2270 | 0.0299 |
| 0.3842 | 2280 | 0.0302 |
| 0.3859 | 2290 | 0.0281 |
| 0.3876 | 2300 | 0.0252 |
| 0.3893 | 2310 | 0.0362 |
| 0.3910 | 2320 | 0.0266 |
| 0.3927 | 2330 | 0.0304 |
| 0.3943 | 2340 | 0.0259 |
| 0.3960 | 2350 | 0.0276 |
| 0.3977 | 2360 | 0.0219 |
| 0.3994 | 2370 | 0.0361 |
| 0.3997 | 2372 | - |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084"
}
@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}
@article{jina-colbert-v2,
title={Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever},
author={Rohan Jha and Bo Wang and Michael Günther and Saba Sturua and Mohammad Kalim Akram and Han Xiao},
year={2024},
journal={arXiv preprint arXiv:2408.16672},
url={https://arxiv.org/abs/2408.16672}
}
Base model
skt/A.X-Encoder-base