Matryoshka Representation Learning
Paper • 2205.13147 • Published • 27
How to use steerrec/yuan-embedding-2.0-zh-query-note with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("steerrec/yuan-embedding-2.0-zh-query-note")
sentences = [
"小狗憋不住尿",
"狗憋不住尿是什么问题\n1、发育不完善\n\n如果是年龄较小的狗狗,由于排尿中枢发育不完善,就有可能会出现憋不住尿的情况。不过这种情况一般随着狗狗年龄的增长,憋尿时间也会得到进一步延长,所以主人不用太过担心。\n\n\t\n\n\n \n\n2、喝太多水\n\n如果狗狗近期喝了太多的水,或者吃了含水量较多的食物 ,狗狗的膀胱就会时刻饱满充盈着尿液,从而法控制尿液。这种情况也是正常的,只要狗狗正常吃喝,精神状态好,就不用太过紧张。\n\n\t\n\n\n \n\n3、受到惊吓\n\n有一些胆子比较小的狗狗,在受到惊吓之后也会出现憋不住尿或者漏尿的情况。这种情况属于机体正常的生理反馈,一般在狗狗惊吓过后就可以自行恢复,不需要特殊的处理,但主人平时也要尽量避免这种情况发生。\n\n\t\n\n\n \n\n4、年龄大\n\n随着狗狗的年龄不断增长,狗狗的神经系统的反应能力也会降低,这时狗狗对于自己尿液的控制能力也会下降,所以就会出现憋不住尿的情况。这种情况几乎没有很好的解决办法,可以尝试性进行针灸或激光理疗,但效果难以保证。\n\n \n\n\t\n\n\n5、患有泌尿系统疾病\n\n如果狗狗除了出现憋不住尿的问题之外,还伴有尿频、尿痛、尿淋漓等症状,那就很可能是患有泌尿系统疾病,比如尿道炎、膀胱炎、尿结石等等。由于结石和炎症长时间刺激泌尿系统相关区域的粘膜,进而导致狗狗无法长时间憋尿,出现尿频的情况。这种问题十分严重,建议主人一旦发现就及时带狗狗到宠物医院治疗。 #新手养狗[话题]# #养狗经验分享[话题]# #养狗攻略[话题]# ",
"离得肝病的人远点,别被传染?\n离得肝病的人远点,别被传染?\n\n我们要知道并不是所有的疾病都具有传染性,肝病也是这样。肝病有很多种,常见的有甲肝、乙肝、丙肝这种病毒感染导致的病毒性肝炎,也有脂肪肝酒精肝这种不良生活习惯导致的肝病,还有一些自身免疫性的肝病等等。但是具有传染性的就只有病毒性肝炎,其他的是不具有传染性的。现在传染性最强也最广泛的就是乙肝,目前国内80%肝病患者都是因为感染了乙肝病毒,现阶段医疗水平还没有达到能完全消灭乙肝的水平,所以乙肝病毒的攻克依然是一个严峻的问题。对于肝病患者来说我们要注意防护,注意日常生活饮食,尽可能地让我们的身体处于一个较好的水平,等待新技术,新药的应用。对于健康人群,我们要做好日常的防护,碗筷消毒,不共用毛巾,水杯等生活用品,我们也不要去歧视和贬低肝病患者,己所不欲勿施于人! #山东[话题]# #肝病[话题]# #健康科普[话题]# \n\n",
"快抄作业✍️结婚必拍爆🔥的喜嫁风婚纱照\n喜嫁婚纱照|新中式婚纱照|婚纱照风格\n📸拍摄:@玛萨婚纱照 \n-\n小红书热拍🔥喜嫁婚纱照合集\n各种类型的喜嫁婚纱照\n不挑人 i人e人轻松驾驭🤏\n东方新娘结婚必拍📣\n-\n主纱 新中式 复古 俏皮都有的喜嫁婚纱照\n🔔满足各种风格的新人们\n✅建议备婚的收藏哦~\n-\n25年的新娘们备婚不走弯路🤫🤫\n必选好看不踩雷的喜嫁婚纱照\n百搭且特别实用👋\n-\n更多类型婚纱照戳主页哦‼️\n咨询拍摄🉑戳@玛萨婚纱照 \n抢先获取 >>惊喜优惠[红包R]\n[打卡R]了解套餐/预约进店 评论🔢\n-\n#芜湖婚纱照[话题]#|#芜湖婚纱照推荐[话题]#|#婚纱摄影[话题]#\n#婚纱照风格[话题]#|#芜湖婚纱摄影[话题]#|#喜嫁婚纱照[话题]#\n#新中式婚纱照[话题]#|#汉服婚纱照[话题]#|#复古婚纱照[话题]#"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from IEITYuan/Yuan-embedding-2.0-zh. It maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for retrieval.
SentenceTransformer(
(0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'architecture': 'BertModel'})
(1): Pooling({'embedding_dimension': 1024, 'pooling_mode': 'mean', 'include_prompt': True})
(2): Dense({'in_features': 1024, 'out_features': 1792, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity', 'module_input_name': 'sentence_embedding', 'module_output_name': 'sentence_embedding'})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
queries = [
'上学文具分享',
]
documents = [
'准初三生的书包里有啥😉👉🏻💗\n都是一些很真实的东西哈哈哈 \n我就问有谁懂…?\n#笔袋[话题]# #笔袋介绍[话题]# #我的文具分享[话题]# \n#晒晒我的书桌[话题]# #我的日常[话题]# \n#whatsinmybag[话题]# #书包里面装什么[话题]# \n#书包[话题]# ',
'下辈子我也要当仓鼠\n傻傻的胖胖的不知道悲伤……#珍藏的宠物照[话题]# #侏儒仓鼠[话题]# #鼠鼠教[话题]# #宠物[话题]# #宠物日常[话题]# ',
'云南丽江~ 束河古镇风景(上)\n拍于2023年11.19哦~\n从白沙坐公交去的束河古镇\n进门的时候忘记拍牌楼了[笑哭R]\n刚进门不久走到了茶马古道博物馆但是是关闭状态的\n不走人多的地方风景很不错\n图上那只松鼠的表情真的笑死\n看起来比香格里拉和雨崩的可爱\n建议沿着水流走 风景不会太差\n#云南[话题]# #云南游[话题]# #云南丽江[话题]# #束河古镇[话题]# ',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 1792] [3, 1792]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 0.6374, 0.0472, -0.0356]])
query_note_test and query_note_valInformationRetrievalEvaluator| Metric | query_note_test | query_note_val |
|---|---|---|
| cosine_accuracy@1 | 0.2839 | 0.3002 |
| cosine_accuracy@3 | 0.5495 | 0.5971 |
| cosine_accuracy@5 | 0.6792 | 0.7274 |
| cosine_accuracy@10 | 0.8055 | 0.8523 |
| cosine_precision@1 | 0.2839 | 0.3002 |
| cosine_precision@3 | 0.2471 | 0.2702 |
| cosine_precision@5 | 0.2212 | 0.2335 |
| cosine_precision@10 | 0.1722 | 0.1784 |
| cosine_recall@1 | 0.111 | 0.1289 |
| cosine_recall@3 | 0.2748 | 0.3301 |
| cosine_recall@5 | 0.3833 | 0.4495 |
| cosine_recall@10 | 0.5335 | 0.6154 |
| cosine_ndcg@10 | 0.4031 | 0.4517 |
| cosine_ndcg@100 | 0.5231 | 0.5634 |
| cosine_mrr@10 | 0.4463 | 0.475 |
| cosine_mrr@100 | 0.4544 | 0.4818 |
| cosine_map@10 | 0.2898 | 0.334 |
| cosine_map@100 | 0.3372 | 0.3809 |
query_note_test_256 and query_note_val_256InformationRetrievalEvaluator with these parameters:{
"truncate_dim": 256
}
| Metric | query_note_test_256 | query_note_val_256 |
|---|---|---|
| cosine_accuracy@1 | 0.2763 | 0.2939 |
| cosine_accuracy@3 | 0.5444 | 0.5918 |
| cosine_accuracy@5 | 0.6674 | 0.7245 |
| cosine_accuracy@10 | 0.8007 | 0.8479 |
| cosine_precision@1 | 0.2763 | 0.2939 |
| cosine_precision@3 | 0.2432 | 0.2667 |
| cosine_precision@5 | 0.2151 | 0.2339 |
| cosine_precision@10 | 0.1695 | 0.1766 |
| cosine_recall@1 | 0.1077 | 0.1255 |
| cosine_recall@3 | 0.2702 | 0.3267 |
| cosine_recall@5 | 0.3715 | 0.4489 |
| cosine_recall@10 | 0.5229 | 0.607 |
| cosine_ndcg@10 | 0.3951 | 0.4467 |
| cosine_ndcg@100 | 0.5137 | 0.5581 |
| cosine_mrr@10 | 0.4397 | 0.4712 |
| cosine_mrr@100 | 0.4477 | 0.4781 |
| cosine_map@10 | 0.2826 | 0.3295 |
| cosine_map@100 | 0.3287 | 0.376 |
anchor and positive| anchor | positive | |
|---|---|---|
| type | string | string |
| modality | text | text |
| details |
|
|
| anchor | positive |
|---|---|
jeep冲锋衣属于档次 |
🧥吉普冲锋衣,秋季必备🍂 |
喝酒文案微醺 |
秋冬时刻需要威士忌 | 见面吗?! |
长沙酒吧 |
长沙一定要去的6️⃣大夜店!!! |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
1024,
256
],
"matryoshka_weights": [
1,
1
],
"n_dims_per_step": -1
}
per_device_train_batch_size: 256learning_rate: 0.0001weight_decay: 0.01num_train_epochs: 1lr_scheduler_type: cosinewarmup_ratio: 0.05seed: 3407bf16: Truebatch_sampler: no_duplicatesoverwrite_output_dir: Falsedo_predict: Falseprediction_loss_only: Trueper_device_train_batch_size: 256per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 0.0001weight_decay: 0.01adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 1max_steps: -1lr_scheduler_type: cosinelr_scheduler_kwargs: {}warmup_ratio: 0.05warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 3407data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Truefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss | query_note_test_cosine_ndcg@100 | query_note_test_256_cosine_ndcg@100 | query_note_val_cosine_ndcg@100 | query_note_val_256_cosine_ndcg@100 |
|---|---|---|---|---|---|---|
| -1 | -1 | - | 0.4211 | 0.4014 | - | - |
| 0.0776 | 50 | 1.8314 | - | - | - | - |
| 0.1553 | 100 | 0.7712 | - | - | - | - |
| 0.2329 | 150 | 0.7237 | - | - | - | - |
| 0.3106 | 200 | 0.7116 | - | - | - | - |
| 0.3882 | 250 | 0.6376 | - | - | - | - |
| 0.4658 | 300 | 0.6284 | - | - | - | - |
| 0.5435 | 350 | 0.6425 | - | - | - | - |
| 0.6211 | 400 | 0.6198 | - | - | - | - |
| 0.6988 | 450 | 0.6409 | - | - | - | - |
| 0.7764 | 500 | 0.5954 | - | - | - | - |
| 0.8540 | 550 | 0.5891 | - | - | - | - |
| 0.9317 | 600 | 0.5833 | - | - | - | - |
| 1.0 | 644 | - | - | - | 0.5634 | 0.5581 |
| -1 | -1 | - | 0.5231 | 0.5137 | - | - |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{oord2019representationlearningcontrastivepredictive,
title={Representation Learning with Contrastive Predictive Coding},
author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
year={2019},
eprint={1807.03748},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1807.03748},
}
Base model
IEITYuan/Yuan-embedding-2.0-zh