yahyaabd's picture
Add new SentenceTransformer model
218423f verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:110773
  - loss:ContrastiveLoss
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
widget:
  - source_sentence: >-
      average monthly net wage/salary, employees, by province and occupation
      (rupiah), 2018
    sentences:
      - >-
        [Seri 2000] Laju Pertumbuhan PDB Triwulanan Atas Dasar Harga Konstan
        2000 Terhadap Triwulan Sebelumnya, 2001-2014
      - >-
        IHK dan Rata-rata Upah per Bulan Buruh Industri di Bawah Mandor
        (Supervisor), 2012-2014 (2012=100)
      - >-
        Rata-rata Upah/Gaji Bersih Sebulan Buruh/Karyawan/Pegawai Menurut
        Kelompok Umur dan Lapangan Pekerjaan Utama di 9 Sektor (Rupiah), 2017
  - source_sentence: >-
      data belanja dan konsumsi per orang di jambi, 2020: fokus pada makanan dan
      tingkat pengeluaran
    sentences:
      - >-
        Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi
        Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi
        Sulawesi Tenggara, 2018-2023
      - >-
        Rata-rata Pendapatan Bersih Pekerja Bebas Menurut Provinsi dan
        Pendidikan Tertinggi yang Ditamatkan (ribu rupiah), 2017
      - >-
        Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi
        Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi Jawa
        Timur, 2018-2023
  - source_sentence: 'ALIRAN DANA RUPIAH: Q1 2008'
    sentences:
      - >-
        Sistem Neraca Sosial Ekonomi Indonesia Tahun 2022 dalam Format SNA 1968
        (65x65)
      - >-
        Rata-rata Upah/Gaji Bersih Sebulan Buruh/Karyawan/Pegawai Menurut
        Provinsi dan Jenis Pekerjaan Utama, 2024
      - Impor Besi dan Baja Menurut Negara Asal Utama, 2017-2023
  - source_sentence: 'Aliran Wdana Rupiah: Q1 2008'
    sentences:
      - Ekspor Karet Remah Menurut Negara Tujuan Utama, 2012-2023
      - >-
        Rata-rata Upah/Gaji Bersih Sebulan Buruh/Karyawan/Pegawai Menurut
        Kelompok Umur dan Lapangan Pekerjaan Utama di 17 Sektor (Rupiah), 2018
      - >-
        Sistem Neraca Sosial Ekonomi Indonesia Tahun 2022 dalam Format SNA 1968
        (65x65)
  - source_sentence: 'Aliran dana Rupiah: Q1 2008'
    sentences:
      - Ringkasan Neraca Arus Dana, Triwulan II, 2011*), (Miliar Rupiah)
      - Ringkasan Neraca Arus Dana, 2012 (Miliar Rupiah)
      - >-
        IHK dan Rata-rata Upah per Bulan Buruh Industri di Bawah Mandor
        (Supervisor), 2012-2014 (2012=100)
datasets:
  - yahyaabd/query-pos-neg-doc-pairs-statictable
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy
  - cosine_accuracy_threshold
  - cosine_f1
  - cosine_f1_threshold
  - cosine_precision
  - cosine_recall
  - cosine_ap
  - cosine_mcc
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    results:
      - task:
          type: binary-classification
          name: Binary Classification
        dataset:
          name: allstats semantic mini v1 test
          type: allstats-semantic-mini-v1_test
        metrics:
          - type: cosine_accuracy
            value: 0.9678628590683177
            name: Cosine Accuracy
          - type: cosine_accuracy_threshold
            value: 0.7482147812843323
            name: Cosine Accuracy Threshold
          - type: cosine_f1
            value: 0.9677936769237264
            name: Cosine F1
          - type: cosine_f1_threshold
            value: 0.7444144487380981
            name: Cosine F1 Threshold
          - type: cosine_precision
            value: 0.9595714405290031
            name: Cosine Precision
          - type: cosine_recall
            value: 0.976158038147139
            name: Cosine Recall
          - type: cosine_ap
            value: 0.9921512853632306
            name: Cosine Ap
          - type: cosine_mcc
            value: 0.9358669477790009
            name: Cosine Mcc
      - task:
          type: binary-classification
          name: Binary Classification
        dataset:
          name: allstats semantic mini v1 dev
          type: allstats-semantic-mini-v1_dev
        metrics:
          - type: cosine_accuracy
            value: 0.9678491772924294
            name: Cosine Accuracy
          - type: cosine_accuracy_threshold
            value: 0.7902499437332153
            name: Cosine Accuracy Threshold
          - type: cosine_f1
            value: 0.9673587968896863
            name: Cosine F1
          - type: cosine_f1_threshold
            value: 0.7874833345413208
            name: Cosine F1 Threshold
          - type: cosine_precision
            value: 0.9616887529731566
            name: Cosine Precision
          - type: cosine_recall
            value: 0.9730960976448341
            name: Cosine Recall
          - type: cosine_ap
            value: 0.9930288231258318
            name: Cosine Ap
          - type: cosine_mcc
            value: 0.9357491510325107
            name: Cosine Mcc

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the query-pos-neg-doc-pairs-statictable dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-search-miniLM-v1-7")
# Run inference
sentences = [
    'Aliran dana Rupiah: Q1 2008',
    'IHK dan Rata-rata Upah per Bulan Buruh Industri di Bawah Mandor (Supervisor), 2012-2014 (2012=100)',
    'Ringkasan Neraca Arus Dana, 2012 (Miliar Rupiah)',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Binary Classification

Metric allstats-semantic-mini-v1_test allstats-semantic-mini-v1_dev
cosine_accuracy 0.9679 0.9678
cosine_accuracy_threshold 0.7482 0.7902
cosine_f1 0.9678 0.9674
cosine_f1_threshold 0.7444 0.7875
cosine_precision 0.9596 0.9617
cosine_recall 0.9762 0.9731
cosine_ap 0.9922 0.993
cosine_mcc 0.9359 0.9357

Training Details

Training Dataset

query-pos-neg-doc-pairs-statictable

  • Dataset: query-pos-neg-doc-pairs-statictable at a31b58d
  • Size: 110,773 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 9 tokens
    • mean: 21.22 tokens
    • max: 50 tokens
    • min: 6 tokens
    • mean: 28.24 tokens
    • max: 50 tokens
    • 0: ~43.90%
    • 1: ~56.10%
  • Samples:
    query doc label
    Data orang yang naik/turun kapal, di pelabuhan yang dikelola maupun tidak, sekitar 2015 Tabel Input-Output Indonesia Transaksi Total Atas Dasar Harga Dasar (185 Produk), 2016 (Juta Rupiah) 0
    data orang yang naik/turun kapal, di pelabuhan yang dikelola maupun tidak, sekitar 2015 Tabel Input-Output Indonesia Transaksi Total Atas Dasar Harga Dasar (185 Produk), 2016 (Juta Rupiah) 0
    DATA ORANG YANG NAIK/TURUN KAPAL, DI PELABUHAN YANG DIKELOLA MAUPUN TIDAK, SEKITAR 2015 Tabel Input-Output Indonesia Transaksi Total Atas Dasar Harga Dasar (185 Produk), 2016 (Juta Rupiah) 0
  • Loss: ContrastiveLoss with these parameters:
    {
        "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
        "margin": 0.5,
        "size_average": true
    }
    

Evaluation Dataset

query-pos-neg-doc-pairs-statictable

  • Dataset: query-pos-neg-doc-pairs-statictable at a31b58d
  • Size: 23,763 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 7 tokens
    • mean: 20.75 tokens
    • max: 57 tokens
    • min: 6 tokens
    • mean: 27.44 tokens
    • max: 43 tokens
    • 0: ~50.20%
    • 1: ~49.80%
  • Samples:
    query doc label
    Cek penghasilan bulanan (gaji bersih) buruh/pegawai, per provinsi dan jenis pekerjaannya, 2019 Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama, 2021 1
    cek penghasilan bulanan (gaji bersih) buruh/pegawai, per provinsi dan jenis pekerjaannya, 2019 Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama, 2021 1
    CEK PENGHASILAN BULANAN (GAJI BERSIH) BURUH/PEGAWAI, PER PROVINSI DAN JENIS PEKERJAANNYA, 2019 Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama, 2021 1
  • Loss: ContrastiveLoss with these parameters:
    {
        "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
        "margin": 0.5,
        "size_average": true
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 1
  • warmup_ratio: 0.2
  • fp16: True
  • load_best_model_at_end: True
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.2
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-mini-v1_test_cosine_ap allstats-semantic-mini-v1_dev_cosine_ap
-1 -1 - - 0.8699 -
0 0 - 0.0489 - 0.8658
0.0578 100 0.0222 0.0101 - 0.9458
0.1155 200 0.0087 0.0073 - 0.9631
0.1733 300 0.007 0.0059 - 0.9710
0.2311 400 0.0056 0.0049 - 0.9828
0.2889 500 0.0045 0.0044 - 0.9837
0.3466 600 0.0042 0.0041 - 0.9862
0.4044 700 0.0038 0.0038 - 0.9888
0.4622 800 0.0037 0.0037 - 0.9890
0.5199 900 0.0029 0.0036 - 0.9889
0.5777 1000 0.0031 0.0034 - 0.9907
0.6355 1100 0.0029 0.0033 - 0.9923
0.6932 1200 0.0025 0.0034 - 0.9922
0.7510 1300 0.0025 0.0033 - 0.9929
0.8088 1400 0.0024 0.0033 - 0.9928
0.8666 1500 0.0022 0.0033 - 0.9926
0.9243 1600 0.0023 0.0033 - 0.9929
0.9821 1700 0.0022 0.0032 - 0.993
-1 -1 - - 0.9922 -
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

ContrastiveLoss

@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
    title={Dimensionality Reduction by Learning an Invariant Mapping},
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}