SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-MiniLM-L6-v2
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("AShi846/all-MiniLM-L6-v2_rag_ft_e-3")
# Run inference
sentences = [
    'The data contains information about submissions to a prestigious machine learning conference called ICLR. Columns:\nyear, paper, authors, ratings, decisions, institution, csranking, categories, authors_citations, authors_publications, authors_hindex, arxiv. The data is stored in a pandas.DataFrame format. \n\nCreate two fields called has_top_company and has_top_institution. The field has_top_company equals 1 if the article contains an author in the following list of companies ["Facebook", "Google", "Microsoft", "Deepmind"], and 0 otherwise. The field has_top_institution equals 1 if the article contains an author in the top 10 institutions according to CSRankings.',
    "Recall that, in the Hedge algorithm we learned in class, the total loss over time is upper bounded by $\\sum_{t = 1}^T m_i^t + \\frac{\\ln N}{\\epsilon} + \\epsilon T$. In the case of investments, we want to do almost as good as the best investment. Let $g_i^t$ be the fractional change of the value of $i$'th investment at time $t$. I.e., $g_i^t = (100 + change(i))/100$, and $p_i^{t+1} = p_i^{t} \\cdot g_i^t$. Thus, after time $T$, $p_i^{T+1} = p_i^1 \\prod_{t = 1}^T g_i^t$. To get an analogous bound to that of the Hedge algorithm, we take the logarithm. The logarithm of the total gain would be $\\sum_{t=1}^T \\ln g_i^t$. To convert this into a loss, we multiply this by $-1$, which gives a loss of $\\sum_{t=1}^T (- \\ln g_i^t)$. Hence, to do almost as good as the best investment, we make our cost vectors to be $m_i^t = - \\ln g_i^t$. Now, from the analysis of Hedge algorithm in the lecture, it follows that for all $i \\in [N]$, $$\\sum_{t = 1}^T p^{(t)}_i \\cdot m^{(t)} \\leq \\sum_{t = 1}^{T} m^{(t)}_i + \\frac{\\ln N}{\\epsilon} + \\epsilon T.$$ Taking the exponent in both sides, We have that \\begin{align*} \\exp \\left( \\sum_{t = 1}^T p^{(t)}_i \\cdot m^{(t)} \\right) &\\leq \\exp \\left( \\sum_{t = 1}^{T} m^{(t)}_i + \\frac{\\ln N}{\\epsilon} + \\epsilon T \\right)\\\\ \\prod_{t = 1}^T \\exp( p^{(t)}_i \\cdot m^{(t)} ) &\\leq \\exp( \\ln N / \\epsilon + \\epsilon T) \\prod_{t = 1}^T \\exp(m^t_i) \\\\ \\prod_{t = 1}^T \\prod_{i \\in [N]} (1 / g_i^t)^{p^{(t)}_i} &\\leq \\exp( \\ln N / \\epsilon + \\epsilon T) \\prod_{t = 1}^{T} (1/g^{(t)}_i) \\end{align*} Taking the $T$-th root on both sides, \\begin{align*} \\left(\\prod_{t = 1}^T \\prod_{i \\in [N]} (1 / g_i^t)^{p^{(t)}_i} \\right)^{(1/T)} &\\leq \\exp( \\ln N / \\epsilon  T + \\epsilon ) \\left( \\prod_{t = 1}^{T} (1/g^{(t)}_i) \\right)^{(1/T)}. \\end{align*} This can be interpreted as the weighted geometric mean of the loss is not much worse than the loss of the best performing investment.",
    '1',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 475 training samples
  • Columns: sentence_0, sentence_1, and label
  • Approximate statistics based on the first 475 samples:
    sentence_0 sentence_1 label
    type string string float
    details
    • min: 5 tokens
    • mean: 135.81 tokens
    • max: 256 tokens
    • min: 3 tokens
    • mean: 110.0 tokens
    • max: 256 tokens
    • min: 0.1
    • mean: 0.1
    • max: 0.1
  • Samples:
    sentence_0 sentence_1 label
    Assume that your team is discussing the following java code:

    public final class DataStructure {
    public void add(int val) { /.../ }

    private boolean isFull() { /.../ }
    }

    Your colleagues were changing the parameter type of "add" to an "Integer". Explain whether this breaks backward compatibility and why or why not (also without worrying about whether this is a good or a bad thing).
    D(cat,dog)=2
    D(cat,pen)=6
    D(cat,table)=6
    D(dog,pen)=6
    D(dog,table)=6
    D(pen,table)=2
    0.1
    If several elements are ready in a reservation station, which
    one do you think should be selected? extbf{Very briefly} discuss
    the options.
    Obama SLOP/1 Election returns document 3 Obama SLOP/2 Election returns documents 3 and T Obama SLOP/5 Election returns documents 3,1, and 2 Thus the values are X=1, x=2, and x=5 Obama = (4 : {1 - [3}, {2 - [6]}, {3 [2,17}, {4 - [1]}) Election = (4: {1 - [4)}, (2 - [1, 21), {3 - [3]}, {5 - [16,22, 51]}) 0.1
    If process i fails, then eventually all processes j≠i fail
    Is the following true? If no process j≠i fails, then process i has failed
    No, it is almost certain that it would not work. On a
    dynamically-scheduled processor, the user is not supposed to
    see the returned value from a speculative load because it will
    never be committed; the whole idea of the attack is to make
    speculatively use of the result and leave a microarchitectural
    trace of the value before the instruction is squashed. In
    Itanium, the returned value of the speculative load
    instruction is architecturally visible and checking whether
    the load is valid is left to the compiler which, in fact,
    might or might not perform such a check. In this context, it
    would have been a major implementation mistake if the value
    loaded speculatively under a memory access violation were the
    true one that the current user is not allowed to access;
    clearly, the implementa...
    0.1
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Framework Versions

  • Python: 3.12.8
  • Sentence Transformers: 3.4.1
  • Transformers: 4.51.3
  • PyTorch: 2.6.0+cu126
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
-
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AShi846/all-MiniLM-L6-v2_rag_ft_e-3

Finetuned
(749)
this model

Paper for AShi846/all-MiniLM-L6-v2_rag_ft_e-3