SentenceTransformer based on IEITYuan/Yuan-embedding-2.0-zh

This is a sentence-transformers model finetuned from IEITYuan/Yuan-embedding-2.0-zh. It maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for retrieval.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: IEITYuan/Yuan-embedding-2.0-zh
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1792 dimensions
  • Similarity Function: Cosine Similarity
  • Supported Modality: Text

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'architecture': 'BertModel'})
  (1): Pooling({'embedding_dimension': 1024, 'pooling_mode': 'mean', 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 1792, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity', 'module_input_name': 'sentence_embedding', 'module_output_name': 'sentence_embedding'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
queries = [
    '上学文具分享',
]
documents = [
    '准初三生的书包里有啥😉👉🏻💗\n都是一些很真实的东西哈哈哈 \n我就问有谁懂…?\n#笔袋[话题]# #笔袋介绍[话题]# #我的文具分享[话题]# \n#晒晒我的书桌[话题]# #我的日常[话题]# \n#whatsinmybag[话题]# #书包里面装什么[话题]# \n#书包[话题]# ',
    '下辈子我也要当仓鼠\n傻傻的胖胖的不知道悲伤……#珍藏的宠物照[话题]# #侏儒仓鼠[话题]# #鼠鼠教[话题]# #宠物[话题]# #宠物日常[话题]# ',
    '云南丽江~ 束河古镇风景(上)\n拍于2023年11.19哦~\n从白沙坐公交去的束河古镇\n进门的时候忘记拍牌楼了[笑哭R]\n刚进门不久走到了茶马古道博物馆但是是关闭状态的\n不走人多的地方风景很不错\n图上那只松鼠的表情真的笑死\n看起来比香格里拉和雨崩的可爱\n建议沿着水流走 风景不会太差\n#云南[话题]# #云南游[话题]# #云南丽江[话题]# #束河古镇[话题]# ',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 1792] [3, 1792]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 0.6374,  0.0472, -0.0356]])

Evaluation

Metrics

Information Retrieval

Metric query_note_test query_note_val
cosine_accuracy@1 0.2839 0.3002
cosine_accuracy@3 0.5495 0.5971
cosine_accuracy@5 0.6792 0.7274
cosine_accuracy@10 0.8055 0.8523
cosine_precision@1 0.2839 0.3002
cosine_precision@3 0.2471 0.2702
cosine_precision@5 0.2212 0.2335
cosine_precision@10 0.1722 0.1784
cosine_recall@1 0.111 0.1289
cosine_recall@3 0.2748 0.3301
cosine_recall@5 0.3833 0.4495
cosine_recall@10 0.5335 0.6154
cosine_ndcg@10 0.4031 0.4517
cosine_ndcg@100 0.5231 0.5634
cosine_mrr@10 0.4463 0.475
cosine_mrr@100 0.4544 0.4818
cosine_map@10 0.2898 0.334
cosine_map@100 0.3372 0.3809

Information Retrieval

  • Datasets: query_note_test_256 and query_note_val_256
  • Evaluated with InformationRetrievalEvaluator with these parameters:
    {
        "truncate_dim": 256
    }
    
Metric query_note_test_256 query_note_val_256
cosine_accuracy@1 0.2763 0.2939
cosine_accuracy@3 0.5444 0.5918
cosine_accuracy@5 0.6674 0.7245
cosine_accuracy@10 0.8007 0.8479
cosine_precision@1 0.2763 0.2939
cosine_precision@3 0.2432 0.2667
cosine_precision@5 0.2151 0.2339
cosine_precision@10 0.1695 0.1766
cosine_recall@1 0.1077 0.1255
cosine_recall@3 0.2702 0.3267
cosine_recall@5 0.3715 0.4489
cosine_recall@10 0.5229 0.607
cosine_ndcg@10 0.3951 0.4467
cosine_ndcg@100 0.5137 0.5581
cosine_mrr@10 0.4397 0.4712
cosine_mrr@100 0.4477 0.4781
cosine_map@10 0.2826 0.3295
cosine_map@100 0.3287 0.376

Training Details

Training Dataset

Unnamed Dataset

  • Size: 164,633 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 100 samples:
    anchor positive
    type string string
    modality text text
    details
    • min: 3 tokens
    • mean: 8.78 tokens
    • max: 17 tokens
    • min: 38 tokens
    • mean: 308.37 tokens
    • max: 512 tokens
  • Samples:
    anchor positive
    jeep冲锋衣属于档次 🧥吉普冲锋衣,秋季必备🍂
    Jeep吉普官方正品,品质有保障👍。

    卡其色/M码,适合90-110斤,款式时尚百搭。

    防风防水设计,适合户外活动,保暖又舒适😊。

    秋季穿搭必备,你值得拥有! #小红书秋日焕新[话题]# #jeep[话题]# #冲锋衣[话题]# #秋季外套[话题]# #正品保证[话题]# #防风防水[话题]# #户外穿搭[话题]# #时尚百搭[话题]# #品质选择[话题]#
    喝酒文案微醺 秋冬时刻需要威士忌 | 见面吗?!
    🍽️
    属于我和先生的浪漫约会日常
    便是放下孩子后的小酌怡情
    刚好最近气温骤降,多了些凉意
    想来一杯Whisky的心情达到上限
    🍽️
    听闻格兰菲迪最近和MU·慕有出合作套餐
    与先生的couple vacation时刻瞬间启动

    在轻奢环境下更适合开一瓶格兰菲迪22年单一麦芽苏格兰威士忌
    @格兰菲迪 Glenfiddich
    作为在全球屡获殊荣的单一麦芽苏格兰威士忌品牌
    自诞生起就从未停止探索的脚步
    不断探究威士忌与餐食搭配的新风味
    🍽️
    当杯中深红琥珀色的酒体
    在灯光下衬得更加迷人时
    黑巧克力以及葡萄果干的浓郁香气也同时散发
    象拔蚌和波士顿龙虾的上场
    将视觉的色与口感的香瞬间突出
    薄如刀片的爽脆蚌肉和Q弹龙虾肉
    在酒精的碰撞下太立体太美好

    余韵的温润柑橘果香和馥郁水果蛋糕香
    像是激情生活之后的源远流长
    搭配新西兰羊排嫩多汁的奶香气息
    像是被酒精注入新的灵感和层次
    🤎
    婚后的微醺节奏
    同样能拉近彼此的距离
    大家也可以打卡格兰菲迪合作餐厅哦
    用一杯格兰菲迪杯找寻浪漫~找寻生活吧
    #广州探店[话题]# #福鹿结伴山海寻鲜[话题]# #威士忌推荐[话题]# #格兰菲迪[话题]# #格兰菲迪单一麦芽威士忌[话题]#
    长沙酒吧 长沙一定要去的6️⃣大夜店!!!
    长沙夜生活 蹦迪💃酒吧 再多的攻略都不及去亲身体验一次!!

    ✅长沙一定要去的夜店❗️❗️
    1️⃣X-sta:长沙最火爆的高空edm场 声光电体验感强 非常具有科技感 总投资1.8亿[赞R] 四五六楼是包厢 娱乐综合体
    2️⃣one9:小厅bounce场 韩国江南风夜店 前身是火爆长沙的NBEK 喜欢小厅的朋友不要错过啦[派对R]
    3️⃣FTF:新开业中型edm/bounce酒吧 位置在海信广场 以前的超级猴子[派对R]
    4️⃣kok:网红爆品小厅 每晚都有男团演出与玩家们互动 人气也一直居高不下[自拍R]
    5️⃣猩猩地堡:连锁品牌 打卡的人很多 长沙年轻人的聚集地
    6️⃣试音:长沙hiphop吧天花板 嘉宾众多气氛火爆 值得打卡

    🌿住宿推荐
    建议住五一广场/开福区万达广场(去哪都方便)

    ⚠️温馨提示:想去长沙夜店玩的朋友记得提前做好攻略,提前找销售预约好门店避免排队!

    希望这些对您有所帮助!
    图片来源于网络 侵删!
    #长沙酒吧[话题]# #长沙酒吧推荐[话题]# #长沙酒吧订台[话题]# #长沙旅游[话题]# #长沙蹦迪[话题]#
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            1024,
            256
        ],
        "matryoshka_weights": [
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 256
  • learning_rate: 0.0001
  • weight_decay: 0.01
  • num_train_epochs: 1
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.05
  • seed: 3407
  • bf16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • prediction_loss_only: True
  • per_device_train_batch_size: 256
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 0.0001
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.05
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 3407
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss query_note_test_cosine_ndcg@100 query_note_test_256_cosine_ndcg@100 query_note_val_cosine_ndcg@100 query_note_val_256_cosine_ndcg@100
-1 -1 - 0.4211 0.4014 - -
0.0776 50 1.8314 - - - -
0.1553 100 0.7712 - - - -
0.2329 150 0.7237 - - - -
0.3106 200 0.7116 - - - -
0.3882 250 0.6376 - - - -
0.4658 300 0.6284 - - - -
0.5435 350 0.6425 - - - -
0.6211 400 0.6198 - - - -
0.6988 450 0.6409 - - - -
0.7764 500 0.5954 - - - -
0.8540 550 0.5891 - - - -
0.9317 600 0.5833 - - - -
1.0 644 - - - 0.5634 0.5581
-1 -1 - 0.5231 0.5137 - -

Training Time

  • Training: 54.3 minutes

Framework Versions

  • Python: 3.12.3
  • Sentence Transformers: 5.5.1
  • Transformers: 4.56.2
  • PyTorch: 2.11.0+cu128
  • Accelerate: 1.13.0
  • Datasets: 4.3.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}
Downloads last month
69
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for steerrec/yuan-embedding-2.0-zh-query-note

Finetuned
(1)
this model

Papers for steerrec/yuan-embedding-2.0-zh-query-note

Evaluation results