SentenceTransformer based on AITeamVN/Vietnamese_Embedding_v2

This is a sentence-transformers model finetuned from AITeamVN/Vietnamese_Embedding_v2 on the tay-vietnamese-nmt dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("HeyDunaX/Tay_Embedding")
# Run inference
sentences = [
    'Các',
    'bắc',
    'chân tay mập',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000,  0.3147, -0.0254],
#         [ 0.3147,  1.0000, -0.1489],
#         [-0.0254, -0.1489,  1.0000]])

Training Details

Training Dataset

tay-vietnamese-nmt

  • Dataset: tay-vietnamese-nmt at 2b04e13
  • Size: 20,554 training samples
  • Columns: sentence1 and sentence2
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2
    type string string
    details
    • min: 3 tokens
    • mean: 6.77 tokens
    • max: 21 tokens
    • min: 3 tokens
    • mean: 5.85 tokens
    • max: 17 tokens
  • Samples:
    sentence1 sentence2
    me bà cô
    noọng ấc cải em ngực bự
    noọng em gái
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Evaluation Dataset

tay-vietnamese-nmt

  • Dataset: tay-vietnamese-nmt at 2b04e13
  • Size: 2,295 evaluation samples
  • Columns: sentence1 and sentence2
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2
    type string string
    details
    • min: 3 tokens
    • mean: 7.24 tokens
    • max: 26 tokens
    • min: 3 tokens
    • mean: 6.02 tokens
    • max: 22 tokens
  • Samples:
    sentence1 sentence2
    Hết fiệc ác làm việc khoẻ
    slấc ác giặc độc ác
    ái chin mác rèo năm mạy Muốn ăn quả thì phải trồng cây
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • gradient_accumulation_steps: 4
  • learning_rate: 1e-05
  • num_train_epochs: 10
  • warmup_ratio: 0.1
  • warmup_steps: 0.1
  • fp16: True
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 8
  • per_device_eval_batch_size: 8
  • gradient_accumulation_steps: 4
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_ratio: 0.1
  • warmup_steps: 0.1
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • enable_jit_checkpoint: False
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • use_cpu: False
  • seed: 42
  • data_seed: None
  • bf16: False
  • fp16: True
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: -1
  • ddp_backend: None
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • auto_find_batch_size: False
  • full_determinism: False
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • use_cache: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss
0.1556 100 1.7414 -
0.3113 200 1.3566 -
0.4669 300 1.1332 -
0.6226 400 1.0198 -
0.7782 500 0.8943 -
0.9339 600 0.7909 -
1.0 643 - 0.7135
1.0887 700 0.7070 -
1.2444 800 0.6029 -
1.4 900 0.6095 -
1.5556 1000 0.5436 -
1.7113 1100 0.5534 -
1.8669 1200 0.5363 -
2.0 1286 - 0.5121
2.0218 1300 0.4886 -
2.1774 1400 0.3853 -
2.3331 1500 0.3940 -
2.4887 1600 0.3859 -
2.6444 1700 0.4035 -
2.8 1800 0.3686 -
2.9556 1900 0.3662 -
3.0 1929 - 0.4505
3.1105 2000 0.3276 -
3.2661 2100 0.2877 -
3.4218 2200 0.2991 -
3.5774 2300 0.2898 -
3.7331 2400 0.2704 -
3.8887 2500 0.2807 -
4.0 2572 - 0.4247
4.0436 2600 0.2879 -
4.1992 2700 0.2300 -
4.3549 2800 0.2233 -
4.5105 2900 0.2169 -
4.6661 3000 0.2273 -
4.8218 3100 0.2149 -
4.9774 3200 0.2277 -
5.0 3215 - 0.4163
5.1323 3300 0.1973 -
5.2879 3400 0.1856 -
5.4436 3500 0.1686 -
5.5992 3600 0.1797 -
5.7549 3700 0.1830 -
5.9105 3800 0.1701 -
6.0 3858 - 0.4066
6.0654 3900 0.1620 -
6.2210 4000 0.1453 -
6.3767 4100 0.1593 -
6.5323 4200 0.1481 -
6.6879 4300 0.1506 -
6.8436 4400 0.1534 -
6.9992 4500 0.1554 -
7.0 4501 - 0.3907
7.1541 4600 0.1284 -
7.3097 4700 0.1266 -
7.4654 4800 0.1392 -
7.6210 4900 0.1292 -
7.7767 5000 0.1309 -
7.9323 5100 0.1318 -
8.0 5144 - 0.3922
8.0872 5200 0.1263 -
8.2428 5300 0.1136 -
8.3984 5400 0.1161 -
8.5541 5500 0.1137 -
8.7097 5600 0.1231 -
8.8654 5700 0.1187 -
9.0 5787 - 0.3875
9.0202 5800 0.1182 -
9.1759 5900 0.1059 -
9.3315 6000 0.1062 -
9.4872 6100 0.1044 -
9.6428 6200 0.0992 -
9.7984 6300 0.1057 -
9.9541 6400 0.1048 -
10.0 6430 - 0.3878

Framework Versions

  • Python: 3.12.12
  • Sentence Transformers: 5.2.2
  • Transformers: 5.0.0
  • PyTorch: 2.9.0+cu126
  • Accelerate: 1.12.0
  • Datasets: 4.0.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
28
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HeyDunaX/Tay_Embedding

Base model

BAAI/bge-m3
Finetuned
(4)
this model

Dataset used to train HeyDunaX/Tay_Embedding

Papers for HeyDunaX/Tay_Embedding