rsajja's picture
Add new SentenceTransformer model
8ec9e55 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:4338
  - loss:CosineSimilarityLoss
  - loss:MultipleNegativesRankingLoss
base_model: sentence-transformers/all-MiniLM-L6-v2
widget:
  - source_sentence: >-
      What are the main climatic factors influencing water level fluctuations in
      lakes, particularly in semi-arid regions?
    sentences:
      - >-
        The main climatic factors influencing water level fluctuations in lakes
        in semi-arid regions include potential evapotranspiration,
        precipitation, temperature, and vapor pressure.
      - >-
        Bias correction improves the accuracy of satellite precipitation data,
        enhancing its effectiveness in streamflow simulation.
      - >-
        Climate change is associated with an increase in the frequency and
        intensity of extreme rainfall events, although regional variations can
        complicate the detection of consistent trends.
  - source_sentence: What is the purpose of the WATYIELD model in hydrology?
    sentences:
      - >-
        Different precipitation datasets can lead to significant variations in
        the simulation of blue and green water resources, impacting water
        resource assessment and management.
      - >-
        The WATYIELD model quantifies the impact of land use changes on stream
        discharge, facilitating predictions based on alterations in vegetation
        cover.
      - >-
        Antecedent wetness conditions influence the timing and magnitude of DOC
        mobilization, with wetter conditions leading to faster and higher DOC
        export compared to drier conditions, which cause delays and reduced
        export.
  - source_sentence: >-
      How does deep groundwater discharge influence solute budgets in
      mountainous watersheds?
    sentences:
      - >-
        Deep groundwater discharge contributes significant solute loads to
        streams, affecting water quality and ecological health.
      - >-
        Strategies include adaptive cooperation, information sharing, water
        conservation, development of alternative water sources, and flexible
        water allocation policies.
      - >-
        Groundwater storage depletion can be influenced by land use changes,
        groundwater abstraction, and decreases in precipitation due to climate
        change.
  - source_sentence: >-
      How can uncertainty in predictive modeling of seawater intrusion be
      effectively quantified and managed in coastal aquifers?
    sentences:
      - >-
        By employing optimized sampling strategies and methods like Null Space
        Monte Carlo to explore parameter spaces while integrating diverse
        measurement data.
      - >-
        Factors include operational costs, potential losses from dam breaches,
        benefits provided by the dam, and social impacts on local communities.
      - >-
        The relative permeability is influenced by phase saturation, wettability
        conditions, capillary number, and the interfacial area between the two
        fluids.
  - source_sentence: What is the relationship between groundwater and streamflow?
    sentences:
      - >-
        Long-chain alkanes and their stable hydrogen isotopes reflect variations
        in vegetation types and moisture sources, providing insights into
        historical precipitation patterns and climatic conditions.
      - >-
        A floating vegetation canopy alters flow dynamics and increases near-bed
        turbulent kinetic energy, which can lead to sediment resuspension and
        reduced deposition beneath the canopy.
      - >-
        Groundwater can sustain streamflow during dry periods, while streams can
        also contribute water back to groundwater through infiltration.
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-MiniLM-L6-v2
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("HydroEmbed/HydroEmbed-OpenQA-MiniLM-DualLoss")
# Run inference
sentences = [
    'What is the relationship between groundwater and streamflow?',
    'Groundwater can sustain streamflow during dry periods, while streams can also contribute water back to groundwater through infiltration.',
    'A floating vegetation canopy alters flow dynamics and increases near-bed turbulent kinetic energy, which can lead to sediment resuspension and reduced deposition beneath the canopy.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Datasets

Unnamed Dataset

  • Size: 2,169 training samples
  • Columns: sentence_0, sentence_1, and label
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 label
    type string string float
    details
    • min: 11 tokens
    • mean: 23.44 tokens
    • max: 45 tokens
    • min: 16 tokens
    • mean: 33.55 tokens
    • max: 71 tokens
    • min: 1.0
    • mean: 1.0
    • max: 1.0
  • Samples:
    sentence_0 sentence_1 label
    How can deep learning technologies improve the identification and management of unregulated private pumping wells in groundwater systems? Deep learning technologies can accurately detect and map private pumping wells using image data, enhancing groundwater management by providing spatial distribution insights and reducing the labor-intensive nature of traditional investigations. 1.0
    How does solar-induced chlorophyll fluorescence relate to vegetation transpiration across different land cover types and environmental conditions? Solar-induced chlorophyll fluorescence exhibits a robust linear correlation with vegetation transpiration, which is influenced by land cover types and various environmental factors, showing higher sensitivity in C4 compared to C3 vegetation. 1.0
    How does soil salinity affect the accuracy of soil moisture measurements from different sensing technologies and satellite products? Soil salinity introduces significant errors in dielectric-based soil moisture measurements, with L-band products being more affected than C-band products. 1.0
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Unnamed Dataset

  • Size: 2,169 training samples
  • Columns: sentence_0, sentence_1, and label
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 label
    type string string float
    details
    • min: 11 tokens
    • mean: 23.58 tokens
    • max: 47 tokens
    • min: 15 tokens
    • mean: 33.32 tokens
    • max: 63 tokens
    • min: 1.0
    • mean: 1.0
    • max: 1.0
  • Samples:
    sentence_0 sentence_1 label
    How does climate change impact agricultural water supply and demand in arid and semi-arid regions? Climate change exacerbates agricultural water scarcity by increasing evaporation rates and altering precipitation patterns, leading to a higher agricultural water demand while potentially reducing the available water supply. 1.0
    How do changes in land use and climate affect river discharge dynamics in Mediterranean catchments? Changes in land use and climate primarily influence river discharge dynamics by altering vegetation cover and its associated water consumption, leading to significant reductions in discharge despite minor changes in precipitation. 1.0
    Why is it important to regularly update rating curves in hydrological studies? Regular updates ensure that changes in river bed profiles or other environmental factors are accurately reflected in discharge estimations. 1.0
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 20
  • fp16: True
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 20
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss
7.3529 500 0.094
14.7059 1000 0.0339

Framework Versions

  • Python: 3.11.1
  • Sentence Transformers: 4.1.0
  • Transformers: 4.51.3
  • PyTorch: 2.7.0+cu118
  • Accelerate: 1.6.0
  • Datasets: 3.5.1
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}