SentenceTransformer based on LazarusNLP/congen-indobert-lite-base

This is a sentence-transformers model finetuned from LazarusNLP/congen-indobert-lite-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: LazarusNLP/congen-indobert-lite-base
  • Maximum Sequence Length: 32 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 32, 'do_lower_case': False, 'architecture': 'AlbertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'Apakah penyidik PPNS memiliki kewenangan untuk memeriksa laporan?',
    'berwenang: a. melakukan pemeriksaan atas kebenaran laporan atau keterangan berkenaan di dengan bidang perlindungan pengelolaan lingkungan hidup; tindak pidana dan',
    'lingkungan hidup adalah kesatuan ruang dengan semua benda, daya, keadaan, dan makhluk hidup, termasuk manusia dan perilakunya, yang mempengaruhi alam itu sendiri, kelangsungan perikehidupan, dan kesejahteraan manusia serta makhluk hidup lain.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000,  0.5783, -0.0924],
#         [ 0.5783,  1.0000,  0.0538],
#         [-0.0924,  0.0538,  1.0000]])

Evaluation

Metrics

Triplet

  • Datasets: retrieval-validation and ai-faq-validation
  • Evaluated with TripletEvaluator
Metric retrieval-validation ai-faq-validation
cosine_accuracy 1.0 1.0

Training Details

Training Dataset

Unnamed Dataset

  • Size: 14,321 training samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 4 tokens
    • mean: 12.17 tokens
    • max: 27 tokens
    • min: 4 tokens
    • mean: 28.56 tokens
    • max: 32 tokens
    • min: 7 tokens
    • mean: 28.47 tokens
    • max: 32 tokens
  • Samples:
    anchor positive negative
    Apa maksud dari paragraf 4? berlaku terhadap paragraf 4 hak gugat pemerintah dan pemerintah daerah pasal 90 berkunjung ke objek wisata bantul tahun 2019 menurut statistik kepariwisataan d.i.yogyakarta tahun 2019 mencapai 8 juta wisatawan, sekitar 2,7 juta diantaranya berkunjung ke pantai parangtritis dan 52 ribu
    Bolehkah HPP meminta bantuan ahli untuk menyelidiki? terdapat bukti, f. meminta bantuan ahli dalam pelaksanaan tugas penyidikan tindak pidana di bidang pengelolaan sampah. kawasan komersial berupa, antara lain, pusat perdagangan, pasar, pertokoan, hotel, perkantoran, restoran, dan tempat hiburan.
    Apa arti lainnya dari simbol "45" pada nama koperasi itu? . kedua sebagai untuk mengenang jasa pahlawan kemerdekaan di tahun 1945 data sekunder mengambil informasi kondisi eksisting dan pengelolaan sampah pada dinas pariwisata kabupaten bantul. 2. berdasarkan tempat, pengambilan data penelitian adalah penelitian lapangan.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 4,092 evaluation samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 6 tokens
    • mean: 12.17 tokens
    • max: 26 tokens
    • min: 3 tokens
    • mean: 28.92 tokens
    • max: 32 tokens
    • min: 7 tokens
    • mean: 28.44 tokens
    • max: 32 tokens
  • Samples:
    anchor positive negative
    Gambar 4.9 menunjukkan kegiatan seperti itu? gambar 4. 9 tpl pantai depok 31 4.1.3 pantai goa cemara . masyarakat hukum adat adalah kelompok masyarakat yang secara turun temurun bermukim di wilayah geografis tertentu karena adanya ikatan pada asal usul leluhur, adanya hubungan yang kuat dengan lingkungan hidup, serta adanya sistem nilai yang
    Apa arti dari Pasal 47 ayat 11? paragraf 11 analisis risiko lingkungan hidup pasal 47 penerapan teknologi yang diperkirakan mempunyai besar untuk potensi mempengaruhi lingkungan hidup.
    Bagaimana dengan daya dukung lingkungan hidup? fungsi 7. daya adalah lingkungan kemampuan lingkungan hidup untuk mendukung perikehidupan manusia, makhluk hidup lain, dan keseimbangan antarkeduanya. dukung hidup 8. daya 78.8 10.83 51.34 8.16 11 76.21 48.13 19.95 51 2.4 5.53 13.28 13 0.9 0.82 48.66 3.15 0.08 3.68 1 5.56 1.21 17.82 36.23 5 19 9 6 85
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • learning_rate: 2e-05
  • num_train_epochs: 0.5
  • warmup_ratio: 0.1
  • load_best_model_at_end: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 0.5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss retrieval-validation_cosine_accuracy ai-faq-validation_cosine_accuracy
-1 -1 - - 0.9990 -
0.0893 10 2.2994 0.3824 0.9995 -
0.1786 20 1.8925 0.2965 1.0 -
0.2679 30 1.5729 0.2591 1.0 -
0.3571 40 1.2261 0.2386 1.0 -
0.4464 50 0.9373 0.2293 1.0 -
-1 -1 - - - 1.0

Framework Versions

  • Python: 3.12.12
  • Sentence Transformers: 5.1.2
  • Transformers: 4.57.1
  • PyTorch: 2.8.0+cu126
  • Accelerate: 1.11.0
  • Datasets: 4.0.0
  • Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
-
Safetensors
Model size
11.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yosriku/congen-indobert-lite-base

Finetuned
(7)
this model

Papers for yosriku/congen-indobert-lite-base

Evaluation results