IshTale's picture
Rename model/README.md to README.md
14305f7 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:269012
  - loss:CoSENTLoss
base_model: intfloat/e5-large-v2
widget:
  - source_sentence: smart cutting machine for crafts
    sentences:
      - HyperX Cloud Alpha Wireless Gaming Headset
      - Rubbermaid Brilliance 20-Piece Food Storage Set
      - >-
        Men's Wick Short Sleeve Crew - Light Merino Wool Camo Hunting Shirt, UV
        Protection Moisture Management Base Layer
  - source_sentence: high capacity portable hard drive
    sentences:
      - Mr. Heater Big Buddy Portable Propane Heater
      - Samsung Galaxy Watch 5 Pro
      - >-
        Sun Bum Original SPF 45 Sunscreen Mist - Broad Spectrum Moisturizing
        Facial Sunscreen Spray with Vitamin E - Hawaii 104 Act Compliant (Made
        without Octinoxate & Oxybenzone) - Travel Friendly - 3.4 oz
  - source_sentence: fluid acrylics for pouring art
    sentences:
      - >-
        Linen Suit for Men 2 Pieces Slim Fit Casual Suits Groomsmen Tuxedos
        Wedding Party Blazer Pants Set Beige
      - Mejuri Small Hoop Earrings in Gold
      - Singer Start 1304 Sewing Machine
  - source_sentence: premium wireless gaming headset
    sentences:
      - Vornado MVH Whole Room Heater
      - >-
        Westinghouse 11000 Peak Watt Tri-Fuel Portable Inverter Generator,
        Remote Start, Transfer Switch Ready, Gas/Propane/Natural Gas Powered,
        Low THD, Safe for Electronics, Parallel Capable, CO Sensor
      - >-
        Rattaner Patio Wicker Furniture Set 6 Pieces Outdoor HDPE Wicker
        Conversation Couch Sectional Chair Sofa Set with Grey Cushions
  - source_sentence: travel system with stroller and car seat
    sentences:
      - Chemex Classic Series Pour-Over Glass Coffeemaker
      - David Yurman Cable Classic Bracelet
      - Legion Stonehenge Paper Pad
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on intfloat/e5-large-v2

This is a sentence-transformers model finetuned from intfloat/e5-large-v2. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: intfloat/e5-large-v2
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'travel system with stroller and car seat',
    'Chemex Classic Series Pour-Over Glass Coffeemaker',
    'Legion Stonehenge Paper Pad',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5295, 0.5210],
#         [0.5295, 1.0000, 0.5429],
#         [0.5210, 0.5429, 1.0000]])

Training Details

Training Dataset

Unnamed Dataset

  • Size: 269,012 training samples
  • Columns: sentence_0, sentence_1, and label
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 label
    type string string float
    details
    • min: 5 tokens
    • mean: 9.64 tokens
    • max: 21 tokens
    • min: 6 tokens
    • mean: 20.82 tokens
    • max: 99 tokens
    • min: -1.0
    • mean: 0.04
    • max: 0.99
  • Samples:
    sentence_0 sentence_1 label
    razor set with handle and blades Hahnemühle Watercolor Journal -0.8008412511835391
    mini perfume atomizer for refillable travel scent LISAPACK Perfume Travel Refillable Bottle - Atomizer Cologne Spray for Men Portable - Mini Sprayer Empty for Refill - Small Size 8ML Striped (Grey, Black, Silver) 0.85625
    pour-over glass coffeemaker Shark Navigator Lift-Away Professional NV356E Vacuum 0.131319533933279
  • Loss: CoSENTLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "pairwise_cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 1
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss
0.0595 500 5.606
0.1189 1000 5.5059
0.1784 1500 5.4614
0.2379 2000 5.4299
0.2974 2500 5.415
0.3568 3000 5.4104
0.4163 3500 5.3718
0.4758 4000 5.3755
0.5353 4500 5.3545
0.5947 5000 5.3498
0.6542 5500 5.3392
0.7137 6000 5.3521
0.7732 6500 5.3248
0.8326 7000 5.3044
0.8921 7500 5.2916
0.9516 8000 5.2891

Framework Versions

  • Python: 3.12.11
  • Sentence Transformers: 5.1.0
  • Transformers: 4.56.0
  • PyTorch: 2.8.0+cu126
  • Accelerate: 1.10.1
  • Datasets: 4.0.0
  • Tokenizers: 0.22.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CoSENTLoss

@online{kexuefm-8847,
    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
    author={Su Jianlin},
    year={2022},
    month={Jan},
    url={https://kexue.fm/archives/8847},
}