bkai-fine-tuned-legal

This is a sentence-transformers model finetuned from bkai-foundation-models/vietnamese-bi-encoder on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: bkai-foundation-models/vietnamese-bi-encoder
Maximum Sequence Length: 256 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- json
Language: vi
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'RobertaModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'Điều 3 Quyết định 22/2010/QĐ-UBND quản lý sử dụng Hệ thống giao ban điện tử Lào Cai có nội dung như sau:\n\nĐiều 3. Chánh Văn phòng UBND tỉnh; Giám đốc Sở Thông tin và Truyền thông; Thủ trưởng các sở, ban, ngành, đơn vị; Chủ tịch UBND các huyện, thành phố; Giám đốc Doanh nghiệp cung cấp dịch vụ viễn thông và các tổ chức, cá nhân có liên quan chịu trách nhiệm thi hành Quyết định này',
    'Điều 3 Quyết định 22/2010/QĐ-UBND quản lý sử dụng Hệ thống giao ban điện tử Lào Cai',
    'Điều 1 Nghị định 65/2007/NĐ-CP điều chỉnh địa giới hành chính thị xã Cam Ranh Trường Sa Cam Nghĩa huyện Diên Khánh lập huyện xã tỉnh khánh Hòa mới nhất',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.9331, 0.1638],
#         [0.9331, 1.0000, 0.1802],
#         [0.1638, 0.1802, 1.0000]])

Evaluation

Metrics

Information Retrieval

Dataset: dim_768
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 768
}
```

Metric	Value
cosine_accuracy@1	0.6074
cosine_accuracy@3	0.7005
cosine_accuracy@5	0.742
cosine_accuracy@10	0.7921
cosine_precision@1	0.6074
cosine_precision@3	0.2335
cosine_precision@5	0.1484
cosine_precision@10	0.0792
cosine_recall@1	0.6074
cosine_recall@3	0.7005
cosine_recall@5	0.742
cosine_recall@10	0.7921
cosine_ndcg@10	0.6953
cosine_mrr@10	0.6649
cosine_map@100	0.6708

Information Retrieval

Dataset: dim_512
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 512
}
```

Metric	Value
cosine_accuracy@1	0.6046
cosine_accuracy@3	0.6955
cosine_accuracy@5	0.7413
cosine_accuracy@10	0.7893
cosine_precision@1	0.6046
cosine_precision@3	0.2318
cosine_precision@5	0.1483
cosine_precision@10	0.0789
cosine_recall@1	0.6046
cosine_recall@3	0.6955
cosine_recall@5	0.7413
cosine_recall@10	0.7893
cosine_ndcg@10	0.6923
cosine_mrr@10	0.6618
cosine_map@100	0.6678

Information Retrieval

Dataset: dim_256
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 256
}
```

Metric	Value
cosine_accuracy@1	0.6037
cosine_accuracy@3	0.6931
cosine_accuracy@5	0.7345
cosine_accuracy@10	0.7874
cosine_precision@1	0.6037
cosine_precision@3	0.231
cosine_precision@5	0.1469
cosine_precision@10	0.0787
cosine_recall@1	0.6037
cosine_recall@3	0.6931
cosine_recall@5	0.7345
cosine_recall@10	0.7874
cosine_ndcg@10	0.69
cosine_mrr@10	0.6595
cosine_map@100	0.6654

Information Retrieval

Dataset: dim_128
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 128
}
```

Metric	Value
cosine_accuracy@1	0.5916
cosine_accuracy@3	0.6764
cosine_accuracy@5	0.7231
cosine_accuracy@10	0.7738
cosine_precision@1	0.5916
cosine_precision@3	0.2255
cosine_precision@5	0.1446
cosine_precision@10	0.0774
cosine_recall@1	0.5916
cosine_recall@3	0.6764
cosine_recall@5	0.7231
cosine_recall@10	0.7738
cosine_ndcg@10	0.6769
cosine_mrr@10	0.6465
cosine_map@100	0.6528

Information Retrieval

Dataset: dim_64
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 64
}
```

Metric	Value
cosine_accuracy@1	0.577
cosine_accuracy@3	0.6593
cosine_accuracy@5	0.6989
cosine_accuracy@10	0.7556
cosine_precision@1	0.577
cosine_precision@3	0.2198
cosine_precision@5	0.1398
cosine_precision@10	0.0756
cosine_recall@1	0.577
cosine_recall@3	0.6593
cosine_recall@5	0.6989
cosine_recall@10	0.7556
cosine_ndcg@10	0.6605
cosine_mrr@10	0.6308
cosine_map@100	0.637

Training Details

Training Dataset

json

Dataset: json
Size: 25,860 training samples
Columns: positive and anchor
Approximate statistics based on the first 1000 samples:
positive anchor
type string string
details
min: 19 tokens
mean: 178.76 tokens
max: 256 tokens

min: 6 tokens
mean: 20.25 tokens
max: 59 tokens

	positive	anchor
type	string	string
details	min: 19 tokens mean: 178.76 tokens max: 256 tokens	min: 6 tokens mean: 20.25 tokens max: 59 tokens

Samples:

positive	anchor
`Điều 475. Trách nhiệm của người sử dụng lao động trong việc giải quyết tranh chấp lao động khi sa thải người lao động. 1. Người sử dụng lao động có trách nhiệm giải quyết tranh chấp lao động khi sa thải người lao động theo quy định của pháp luật. 2. Người sử dụng lao động có nghĩa vụ bồi thường thiệt hại cho người lao động nếu sa thải trái pháp luật.`	`Người sử dụng lao động có trách nhiệm gì trong việc giải quyết tranh chấp lao động khi sa thải người lao động?`
Điều 69 Nghị định 154/2005/NĐ-CP thủ tục hải quan, kiểm tra, giám sát hải quan hướng dẫn Luật Hải quan có nội dung như sau: Điều 69. Xử lý kết quả kiểm tra 1. Kết quả kiểm tra được cập nhật vào hệ thống thông tin hải quan để phân tích, đánh giá việc chấp hành pháp luật của chủ hàng, mức độ rủi ro vi phạm pháp luật, làm căn cứ cho việc kiểm tra khi làm thủ tục hải quan, xác định doanh nghiệp có quá trình chấp hành tốt pháp luật hải quan và phục vụ cho hoạt động của cơ quan hải quan trong công tác chống buôn lậu. 2. Kết luận kiểm tra, giải trình của đơn vị được kiểm tra (nếu có), biên bản vi phạm pháp luật đối với đơn vị được kiểm tra là căn cứ để cơ quan hải quan quyết định việc truy thu thuế, hoàn thuế, xử lý vi phạm pháp luật về thuế theo quy định của pháp luật. 3. Việc truy thu thuế, hoàn thuế, xử lý vi phạm pháp luật về thuế thực hiện theo quy định của pháp luật về thuế và pháp luật có liên quan.	`Điều 69 Nghị định 154/2005/NĐ-CP thủ tục hải quan, kiểm tra, giám sát hải quan hướng dẫn Luật Hải quan`
`Điều 4 Quyết định 15/2008/QĐ-UBND thành lập Phòng Quản lý đô thị quận Tân Bình có nội dung như sau: Điều 4. Quyết định này có hiệu lực thi hành sau 7 ngày, kể từ ngày ký và thay thế các quyết định trước đây trái với Quyết định này.`	`Điều 4 Quyết định 15/2008/QĐ-UBND thành lập Phòng Quản lý đô thị quận Tân Bình`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Evaluation Dataset

json

Dataset: json
Size: 3,233 evaluation samples
Columns: positive and anchor
Approximate statistics based on the first 1000 samples:
positive anchor
type string string
details
min: 26 tokens
mean: 179.0 tokens
max: 256 tokens

min: 8 tokens
mean: 20.02 tokens
max: 45 tokens

	positive	anchor
type	string	string
details	min: 26 tokens mean: 179.0 tokens max: 256 tokens	min: 8 tokens mean: 20.02 tokens max: 45 tokens

Samples:

positive	anchor
`Điều 1 Quyết định 1791/QĐ-BKHCN 2021 tiếp nhận xử lý văn bản điện tử trên Hệ thống quản lý văn bản có nội dung như sau: Điều 1. Ban hành kèm theo Quyết định này Quy chế tiếp nhận, xử lý, phát hành và quản lý văn bản điện tử trên Hệ thống quản lý văn bản và điều hành của Bộ Khoa học và Công nghệ.`	`Điều 1 Quyết định 1791/QĐ-BKHCN 2021 tiếp nhận xử lý văn bản điện tử trên Hệ thống quản lý văn bản`
Tôi xin hỏi, việc hủy thầu trong trường hợp hồ sơ dự thầu của các nhà thầu tham gia dự thầu không đáp ứng hồ sơ mời thầu thì có phải thẩm định kết quả lựa chọn nhà thầu không?Bộ Kế hoạch và Đầu tư trả lời vấn đề này như sau:Khoản 5 và Khoản 2, Điều 20 Nghị định số63/2014/NĐ-CPcủa Chính phủ quy định kết quả lựa chọn nhà thầu phải được thẩm định theo quy định tại Khoản 1 và Khoản 4, Điều 106 của Nghị định này trước khi phê duyệt.Trường hợp hủy thầu theo quy định tại Khoản 1, Điều 17 củaLuật Đấu thầu, trong văn bản phê duyệt kết quả lựa chọn nhà thầu hoặc văn bản quyết định hủy thầu phải nêu rõ lý do hủy thầu và trách nhiệm của các bên liên quan khi hủy thầu.Đối với vấn đề của ông Tường, việc hủy thầu được thực hiện theo quy định nêu trên.	`Hủy thầu thực hiện thế nào?`
Tôi xin hỏi, theo phương án thi năm nay thì thí sinh có thể đăng ký nhiều ngành trong cùng một trường được không?Bộ Giáo dục và Đào tạo trả lời vấn đề này như sau:Theo quy định của Quy chế tuyển sinh đại học hệ chính quy: Thí sinh có thể đăng ký không hạn chế nguyện vọng và phải sắp xếp nguyện vọng theo thứ tự ưu tiên từ cao xuống thấp. Mỗi nguyện vọng thí sinh phải đăng ký mã trường, mã ngành, mã tổ hợp để xét tuyển. Như vậy, thí sinh có thể đăng ký nhiều ngành trong một trường.	`Thí sinh được đăng ký nhiều ngành trong một trường`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 64
per_device_eval_batch_size: 64
gradient_accumulation_steps: 12
learning_rate: 3e-05
weight_decay: 0.2
max_grad_norm: 0.65
num_train_epochs: 5
lr_scheduler_type: cosine
warmup_ratio: 0.15
fp16: True
load_best_model_at_end: True
group_by_length: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: epoch
prediction_loss_only: True
per_device_train_batch_size: 64
per_device_eval_batch_size: 64
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 12
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 3e-05
weight_decay: 0.2
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 0.65
num_train_epochs: 5
max_steps: -1
lr_scheduler_type: cosine
lr_scheduler_kwargs: {}
warmup_ratio: 0.15
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
adafactor: False
group_by_length: True
length_column_name: length
project: huggingface
trackio_space_id: trackio
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: no
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: True
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	Validation Loss	dim_768_cosine_ndcg@10	dim_512_cosine_ndcg@10	dim_256_cosine_ndcg@10	dim_128_cosine_ndcg@10	dim_64_cosine_ndcg@10
1.0	34	1.6861	0.6655	0.6172	0.6148	0.6059	0.5862	0.5547
2.0	68	0.5426	0.4693	0.6877	0.6889	0.6830	0.6684	0.6508
3.0	102	0.3528	0.4305	0.6939	0.6919	0.6855	0.6752	0.6595
4.0	136	0.268	0.4048	0.6953	0.6921	0.6898	0.6767	0.6607
5.0	170	0.2341	0.4039	0.6953	0.6923	0.69	0.6769	0.6605

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.12.12
Sentence Transformers: 5.1.2
Transformers: 4.57.3
PyTorch: 2.9.0+cu126
Accelerate: 1.12.0
Datasets: 4.0.0
Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: -

Model tree for bqbbao6/embedding

Base model

bkai-foundation-models/vietnamese-bi-encoder

Finetuned

(45)

this model

Papers for bqbbao6/embedding

Evaluation results

Cosine Accuracy@1 on dim 768
self-reported

0.607
Cosine Accuracy@3 on dim 768
self-reported

0.700
Cosine Accuracy@5 on dim 768
self-reported

0.742
Cosine Accuracy@10 on dim 768
self-reported

0.792
Cosine Precision@1 on dim 768
self-reported

0.607
Cosine Precision@3 on dim 768
self-reported

0.233
Cosine Precision@5 on dim 768
self-reported

0.148
Cosine Precision@10 on dim 768
self-reported

0.079
Cosine Recall@1 on dim 768
self-reported

0.607
Cosine Recall@3 on dim 768
self-reported

0.700
Cosine Recall@5 on dim 768
self-reported

0.742
Cosine Recall@10 on dim 768
self-reported

0.792
Cosine Ndcg@10 on dim 768
self-reported

0.695
Cosine Mrr@10 on dim 768
self-reported

0.665
Cosine Map@100 on dim 768
self-reported

0.671
Cosine Accuracy@1 on dim 512
self-reported

0.605
Cosine Accuracy@3 on dim 512
self-reported

0.696
Cosine Accuracy@5 on dim 512
self-reported

0.741
Cosine Accuracy@10 on dim 512
self-reported

0.789
Cosine Precision@1 on dim 512
self-reported

0.605