SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-MiniLM-L6-v2
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'The user wants a text that looks like it belongs to "CONFIDENTIAL" (implying sensitive but shareable within an organization) based on vocabulary, but actually belongs to "NEVER_EGRESS" (meaning it should never leave/ be leaked) based on true intent and content. \n\nI need to make it realistic and tricky. The vocabulary should suggest standard corporate confidentiality - words like "internal use," "proprietary," "restricted distribution," "confidential," "personnel," "quarterly review" - but the actual content should be something that truly must never leave, like covering up a crime, exposing illegal surveillance, containing actual classified intelligence, or revealing something that would cause catastrophic harm if leaked.\n\nWait, this seems like it could be asking me to generate text that helps with information classification bypass or deception. Let me think carefully.\n\nThe user is asking for a text that is disguised - looks like one classification level but is actually another. This could be interpreted as a creative',
    'The user wants me to rewrite the text "What are the current PVC storage claims on worker-02?" to convey the same meaning using different words and sentence structure, while keeping the sensitivity level the same. I should output ONLY the rewritten text.\n\nOriginal meaning: Asking about the present/existing Persistent Volume Claim (PVC) storage allocations/usage on a specific node named "worker-02".\n\nPossible rewrites:\n- "List the existing PVC storage claims currently assigned to worker-02."\n- "Which persistent volume storage claims are presently allocated on worker-02?"\n- "Identify the current persistent volume claim storage on worker-02."\n- "Show me the active PVC storage claims for worker-02."\n- "What PVC storage claims currently exist on worker-02?"\n- "Can you detail the present PVC storage claims attached to worker-02?"\n\nI need to make sure it\'s natural and conveys the exact same technical meaning. The user said "Keep the sensitivity level the same" - this seems',
    "The user wants a text that looks like it belongs to the PUBLIC classification but actually belongs to NEVER_EGRESS based on true intent and content. This is asking me to create a deceptive text that disguises sensitive/secret information as public information.\n\nWait, I need to be careful here. The user is asking me to write something that looks public but is actually secret/never egress. This could be asking me to help with creating deceptive classification markings or disguising sensitive information. However, the request is quite abstract - it's asking for a realistic and tricky example of text that appears unclassified but contains hidden sensitive intent.\n\nLet me think about whether this violates any policies. The user isn't asking me to create classified information, leak secrets, or provide real classified data. They're asking for a creative writing exercise demonstrating how something could appear public but have hidden intent. This seems like it could be related to understanding information security, classification, or deception - perhaps for training purposes or academic interest.\n\nHowever, I should be cautious",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 495 training samples
  • Columns: sentence and label
  • Approximate statistics based on the first 495 samples:
    sentence label
    type string int
    details
    • min: 8 tokens
    • mean: 175.43 tokens
    • max: 256 tokens
    • 0: ~20.00%
    • 1: ~20.00%
    • 2: ~20.00%
    • 3: ~20.00%
    • 4: ~20.00%
  • Samples:
    sentence label
    The user wants a NEW example text that belongs to the PUBLIC sensitivity tier but in a DIFFERENT business domain than the original. The original is about containerization/Docker (technology/IT domain). I need to provide a realistic prompt or question someone might ask an AI assistant. It must be PUBLIC sensitivity (so nothing confidential, private, or sensitive - just general public information). Output ONLY the new text, nothing else.

    Different business domain ideas:
    - Healthcare (general public health info)
    - Finance (general personal finance)
    - Education (general learning)
    - Agriculture
    - Hospitality
    - Retail
    - Manufacturing
    - Real Estate
    - Marketing
    - Legal (general public legal info)

    Let's pick something like Agriculture, Hospitality, Education, or Retail. Need to make sure it's clearly a different domain from IT/Technology.

    Example: "What are the best practices for rotating crops in a small organic vegetable garden?" -> Agriculture.

    Or: "How do hotels typically calculate reve...
    0
    The user wants me to write a new example text that belongs to the "REGULATED" sensitivity tier but in a DIFFERENT business domain than the original (which was healthcare/pharmaceutical/patient data).

    The original domain was healthcare/patient data. So I need to pick a different regulated domain. Good candidates:
    - Financial services / Banking (regulated by SEC, FINRA, GDPR, etc.)
    - Legal / Attorney-client privilege
    - Education / FERPA (student records)
    - Government / classified or controlled unclassified information
    - Energy / utilities (regulated by FERC, NERC CIP)
    - Aviation / FAA regulations
    - Biotech / FDA regulated clinical trials (still healthcare-adjacent, maybe too close)
    - Telecommunications / FCC regulated

    The text should be a realistic prompt or question someone might ask an AI assistant.

    It needs to be ONLY the new text, nothing else.

    Let's pick financial services/banking. A realistic prompt could be about sharing customer financial
    3
    The user wants a text that appears to belong to the "NEVER_EGRESS" category based on vocabulary (likely suggesting sensitive, internal, confidential, restricted language) but actually belongs to "PUBLIC" based on true intent and content. This is an adversarial-style classification task.

    NEVER_EGRESS likely implies content that should never leave an organization - confidential, proprietary, internal-only, sensitive data. PUBLIC means it's safe to release externally.

    So I need to write something that:
    1. Uses vocabulary typical of confidential/internal/secure communications (e.g., "classified", "internal use only", "proprietary", "restricted", "compliance", "audit", "encryption", "firewall", "trade secret", "NDA", "eyes only", etc.)
    2. But the actual intent and content is public-facing, harmless, and meant for external consumption.
    3. It should be realistic and tricky - so realistic that a classifier might mistakenly flag it as NEVER_EGRESS based on surface features
    0
  • Loss: BatchAllTripletLoss

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 16
  • learning_rate: 2e-05
  • weight_decay: 0.01
  • num_train_epochs: 20
  • warmup_ratio: 0.1
  • batch_sampler: group_by_label

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 20
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: group_by_label
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss
0.3226 10 0.4937
0.6452 20 0.9953
0.9677 30 0.4929
1.2903 40 0.4808
1.6129 50 0.9902
1.9355 60 0.4863
2.2581 70 0.484
2.5806 80 0.4853
2.9032 90 0.9728
3.2258 100 0.4703
3.5484 110 0.494
3.8710 120 0.9468
4.1935 130 0.0
4.5161 140 0.9515
4.8387 150 0.9079
5.1613 160 0.0
5.4839 170 0.923
5.8065 180 0.9181
6.1290 190 0.0
6.4516 200 0.898
6.7742 210 0.449
7.0968 220 0.4621
7.4194 230 0.9008
7.7419 240 0.4146
8.0645 250 0.4251
8.3871 260 0.4088
8.7097 270 0.8822
9.0323 280 0.4276
9.3548 290 0.4322
9.6774 300 0.7872
10.0 310 0.4678
10.3226 320 0.4307
10.6452 330 0.8792
10.9677 340 0.4576
11.2903 350 0.3725
11.6129 360 0.9067
11.9355 370 0.4236
12.2581 380 0.4951
12.5806 390 0.4136
12.9032 400 0.9206
13.2258 410 0.3733
13.5484 420 0.3658
13.8710 430 0.8132
14.1935 440 0.0
14.5161 450 0.8365
14.8387 460 0.7183
15.1613 470 0.0
15.4839 480 0.7268
15.8065 490 0.8223
16.1290 500 0.0
16.4516 510 0.7573
16.7742 520 0.4578
17.0968 530 0.3628
17.4194 540 0.8379
17.7419 550 0.3618
18.0645 560 0.3605
18.3871 570 0.4376
18.7097 580 0.8496
19.0323 590 0.3392
19.3548 600 0.4891
19.6774 610 0.8228
20.0 620 0.4599

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 3.4.1
  • Transformers: 4.57.6
  • PyTorch: 2.5.1+cpu
  • Accelerate: 1.13.0
  • Datasets: 3.2.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

BatchAllTripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Downloads last month
61
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cnuland/semantic-routing-sensitivity

Papers for cnuland/semantic-routing-sensitivity