dragonkue/colbert-ko-0.1b

This is a PyLate model finetuned from skt/A.X-Encoder-base on the parquet dataset. It maps sentences & paragraphs to sequences of 32, 64, 96, 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

Model Details

Model Description

  • Model Type: PyLate model
  • Base model: skt/A.X-Encoder-base
  • Document Length: 2048 tokens
  • Query Length: 32 tokens
  • Output Dimensionality: 128 tokens
  • Similarity Function: MaxSim
  • Training Dataset:
    • parquet

Model Sources

Full Model Architecture

ColBERTWrapper(
  (0): Transformer({'max_seq_length': 2047, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
)

Usage

This model supports Matryoshka embeddings with multiple dimensions (32, 64, 96, 128) using separate projection heads (Jina-ColBERT-v2 style).

Installation

pip install colbert-matryoshka

Quick Start (Matryoshka)

from colbert_matryoshka import MatryoshkaColBERT

# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")

# Set embedding dimension (32, 64, 96, or 128)
model.set_active_dim(128)

# Encode queries and documents
query_embeddings = model.encode(["검색 쿼리"], is_query=True)
doc_embeddings = model.encode(["문서 내용"], is_query=False)

print(f"Query shape: {query_embeddings[0].shape}")  # (num_tokens, 128)
print(f"Doc shape: {doc_embeddings[0].shape}")      # (num_tokens, 128)

Retrieval (PyLate Index)

Use this model with PyLate to index and retrieve documents. The index uses FastPLAID for efficient similarity search.

from colbert_matryoshka import MatryoshkaColBERT
from pylate import indexes, retrieve

# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
model.set_active_dim(128)

# Initialize PLAID index
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
    override=True,
)

# Encode and index documents
documents_ids = ["1", "2", "3"]
documents = ["첫번째 문서입니다", "두번째 문서입니다", "세번째 문서입니다"]

documents_embeddings = model.encode(documents, is_query=False)
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

# Retrieve
retriever = retrieve.ColBERT(index=index)
queries_embeddings = model.encode(["첫번째 문서 검색"], is_query=True)

scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=3,
)
print(scores)
# [[{'id': '1', 'score': 24.51}, {'id': '2', 'score': 23.54}, {'id': '3', 'score': 23.33}]]

Reranking

from colbert_matryoshka import MatryoshkaColBERT
from pylate import rank

# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
model.set_active_dim(128)

queries = ["인공지능 기술", "한국어 자연어처리"]

documents = [
    ["AI와 머신러닝에 대한 문서", "요리 레시피 문서"],
    ["한국어 NLP 연구", "영어 문법 설명", "프로그래밍 튜토리얼"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

# Encode queries
queries_embeddings = model.encode(queries, is_query=True)

# Encode documents (per query)
documents_embeddings = []
for docs in documents:
    documents_embeddings.append(model.encode(docs, is_query=False))

# Rerank
reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)
print(reranked_documents)
# Query "인공지능 기술": [{'id': 1, 'score': 3.63}, {'id': 2, 'score': 0.90}]
# Query "한국어 자연어처리": [{'id': 1, 'score': 4.60}, {'id': 3, 'score': 3.88}, {'id': 2, 'score': 1.93}]

Evaluation Results (NDCG@10)

Comparison with Other Models (dim128)

Model AutoRAG Ko-StrategyQA NanoBEIR-Ko Avg
dragonkue-colbert-ko-0.1b (149M) 0.989 0.741 0.519 0.750
BGE-M3-MultiVec (568M) 0.844 0.797 0.569 0.737
LFM2-ColBERT (353M) 0.833 0.757 0.528 0.706
colbert-ko-v1 (149M) 0.966 0.713 0.476 0.718

Performance by Embedding Dimension

Dimension AutoRAG Ko-StrategyQA NanoBEIR-Ko
32 0.983 0.721 0.504
64 0.985 0.728 0.510
96 0.979 0.736 0.517
128 0.989 0.741 0.519
  • Loss: src.losses.MatryoshkaColBERTLoss with these parameters:
    {
        "dims": [
            32,
            64,
            96,
            128
        ],
        "weights": [
            0.25,
            0.25,
            0.25,
            0.25
        ],
        "temperature": 1.0
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • learning_rate: 1e-05
  • num_train_epochs: 2
  • warmup_ratio: 0.1
  • fp16: True
  • dataloader_drop_last: True
  • gradient_checkpointing: True
  • gradient_checkpointing_kwargs: {'use_reentrant': False}
  • router_mapping: {'anchor': 'query', 'positive': 'document', 'neg_0': 'document'}

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: True
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: True
  • gradient_checkpointing_kwargs: {'use_reentrant': False}
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {'anchor': 'query', 'positive': 'document', 'neg_0': 'document'}
  • learning_rate_mapping: {}

Training Logs

Click to expand
Epoch Step Training Loss
0.0017 10 4.1388
0.0034 20 4.1142
0.0051 30 3.9797
0.0067 40 3.8761
0.0084 50 3.6167
0.0101 60 3.424
0.0118 70 3.0256
0.0135 80 2.827
0.0152 90 2.5787
0.0169 100 2.2696
0.0185 110 2.0266
0.0202 120 1.6815
0.0219 130 1.4739
0.0236 140 1.2877
0.0253 150 1.1474
0.0270 160 1.0143
0.0286 170 0.9363
0.0303 180 0.9189
0.0320 190 0.7442
0.0337 200 0.6919
0.0354 210 0.6251
0.0371 220 0.6527
0.0388 230 0.5923
0.0404 240 0.572
0.0421 250 0.5255
0.0438 260 0.4407
0.0455 270 0.5038
0.0472 280 0.3939
0.0489 290 0.3938
0.0506 300 0.3253
0.0522 310 0.335
0.0539 320 0.2855
0.0556 330 0.2396
0.0573 340 0.252
0.0590 350 0.2299
0.0607 360 0.2133
0.0624 370 0.2186
0.0640 380 0.1935
0.0657 390 0.1743
0.0674 400 0.1462
0.0691 410 0.1552
0.0708 420 0.1491
0.0725 430 0.1581
0.0741 440 0.1635
0.0758 450 0.1383
0.0775 460 0.1377
0.0792 470 0.1155
0.0809 480 0.1184
0.0826 490 0.1333
0.0843 500 0.1341
0.0859 510 0.1259
0.0876 520 0.0748
0.0893 530 0.1342
0.0910 540 0.1058
0.0927 550 0.1024
0.0944 560 0.0921
0.0961 570 0.104
0.0977 580 0.1069
0.0994 590 0.0925
0.1011 600 0.1146
0.1028 610 0.0682
0.1045 620 0.0711
0.1062 630 0.1491
0.1079 640 0.0602
0.1095 650 0.0753
0.1112 660 0.0713
0.1129 670 0.0739
0.1146 680 0.0783
0.1163 690 0.0678
0.1180 700 0.0963
0.1196 710 0.0677
0.1213 720 0.0829
0.1230 730 0.0719
0.1247 740 0.0646
0.1264 750 0.0927
0.1281 760 0.0755
0.1298 770 0.0799
0.1314 780 0.0535
0.1331 790 0.0555
0.1348 800 0.0804
0.1365 810 0.0627
0.1382 820 0.0726
0.1399 830 0.0685
0.1416 840 0.0421
0.1432 850 0.0895
0.1449 860 0.0964
0.1466 870 0.0515
0.1483 880 0.0825
0.1500 890 0.0801
0.1517 900 0.0579
0.1534 910 0.0559
0.1550 920 0.0432
0.1567 930 0.0553
0.1584 940 0.0577
0.1601 950 0.0451
0.1618 960 0.049
0.1635 970 0.0459
0.1651 980 0.0684
0.1668 990 0.0449
0.1685 1000 0.0392
0.1702 1010 0.071
0.1719 1020 0.0511
0.1736 1030 0.0501
0.1753 1040 0.0464
0.1769 1050 0.0678
0.1786 1060 0.0597
0.1803 1070 0.0569
0.1820 1080 0.044
0.1837 1090 0.0452
0.1854 1100 0.0394
0.1871 1110 0.0496
0.1887 1120 0.0296
0.1904 1130 0.0321
0.1921 1140 0.0525
0.1938 1150 0.058
0.1955 1160 0.0552
0.1972 1170 0.035
0.1989 1180 0.0468
0.1999 1186 -
0.2005 1190 0.0383
0.2022 1200 0.0599
0.2039 1210 0.0572
0.2056 1220 0.0383
0.2073 1230 0.0486
0.2090 1240 0.0407
0.2107 1250 0.044
0.2123 1260 0.04
0.2140 1270 0.0338
0.2157 1280 0.036
0.2174 1290 0.0511
0.2191 1300 0.0472
0.2208 1310 0.031
0.2224 1320 0.0614
0.2241 1330 0.0388
0.2258 1340 0.0403
0.2275 1350 0.047
0.2292 1360 0.033
0.2309 1370 0.0524
0.2326 1380 0.0357
0.2342 1390 0.0463
0.2359 1400 0.0355
0.2376 1410 0.0411
0.2393 1420 0.028
0.2410 1430 0.0386
0.2427 1440 0.0553
0.2444 1450 0.0353
0.2460 1460 0.0462
0.2477 1470 0.0399
0.2494 1480 0.0319
0.2511 1490 0.0456
0.2528 1500 0.0302
0.2545 1510 0.0366
0.2562 1520 0.0409
0.2578 1530 0.0337
0.2595 1540 0.0362
0.2612 1550 0.0318
0.2629 1560 0.0433
0.2646 1570 0.0379
0.2663 1580 0.0419
0.2679 1590 0.0225
0.2696 1600 0.0269
0.2713 1610 0.0295
0.2730 1620 0.048
0.2747 1630 0.0382
0.2764 1640 0.0341
0.2781 1650 0.0334
0.2797 1660 0.0534
0.2814 1670 0.0445
0.2831 1680 0.0284
0.2848 1690 0.0327
0.2865 1700 0.0309
0.2882 1710 0.0372
0.2899 1720 0.0384
0.2915 1730 0.022
0.2932 1740 0.0266
0.2949 1750 0.0399
0.2966 1760 0.0342
0.2983 1770 0.0391
0.3000 1780 0.0349
0.3017 1790 0.0365
0.3033 1800 0.0322
0.3050 1810 0.0414
0.3067 1820 0.0297
0.3084 1830 0.0446
0.3101 1840 0.0312
0.3118 1850 0.0379
0.3134 1860 0.0252
0.3151 1870 0.0424
0.3168 1880 0.0367
0.3185 1890 0.0226
0.3202 1900 0.0319
0.3219 1910 0.0189
0.3236 1920 0.0219
0.3252 1930 0.0341
0.3269 1940 0.0505
0.3286 1950 0.0176
0.3303 1960 0.0328
0.3320 1970 0.0276
0.3337 1980 0.0251
0.3354 1990 0.0603
0.3370 2000 0.0243
0.3387 2010 0.0316
0.3404 2020 0.0294
0.3421 2030 0.025
0.3438 2040 0.0255
0.3455 2050 0.0318
0.3472 2060 0.025
0.3488 2070 0.0273
0.3505 2080 0.0338
0.3522 2090 0.0299
0.3539 2100 0.0275
0.3556 2110 0.0184
0.3573 2120 0.0244
0.3589 2130 0.0432
0.3606 2140 0.0325
0.3623 2150 0.0525
0.3640 2160 0.0329
0.3657 2170 0.0236
0.3674 2180 0.0309
0.3691 2190 0.0195
0.3707 2200 0.0318
0.3724 2210 0.0229
0.3741 2220 0.0312
0.3758 2230 0.0186
0.3775 2240 0.0231
0.3792 2250 0.0262
0.3809 2260 0.0287
0.3825 2270 0.0299
0.3842 2280 0.0302
0.3859 2290 0.0281
0.3876 2300 0.0252
0.3893 2310 0.0362
0.3910 2320 0.0266
0.3927 2330 0.0304
0.3943 2340 0.0259
0.3960 2350 0.0276
0.3977 2360 0.0219
0.3994 2370 0.0361
0.3997 2372 -

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 5.1.1
  • PyLate: 1.3.4
  • Transformers: 4.56.2
  • PyTorch: 2.8.0+cu128
  • Accelerate: 1.12.0
  • Datasets: 4.4.1
  • Tokenizers: 0.22.2-rc0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084"
}

PyLate

@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}

Jina ColBERT v2

@article{jina-colbert-v2,
    title={Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever},
    author={Rohan Jha and Bo Wang and Michael Günther and Saba Sturua and Mohammad Kalim Akram and Han Xiao},
    year={2024},
    journal={arXiv preprint arXiv:2408.16672},
    url={https://arxiv.org/abs/2408.16672}
}
Downloads last month
17
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dragonkue/colbert-ko-0.1b

Finetuned
(5)
this model

Collection including dragonkue/colbert-ko-0.1b

Papers for dragonkue/colbert-ko-0.1b