HydroEmbed-OpenQA-MiniLM-DualLoss / README.md

rsajja

Add new SentenceTransformer model

8ec9e55 verified 8 months ago

preview code

raw

history blame contribute delete

19.9 kB

metadata

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:4338
  - loss:CosineSimilarityLoss
  - loss:MultipleNegativesRankingLoss
base_model: sentence-transformers/all-MiniLM-L6-v2
widget:
  - source_sentence: >-
      What are the main climatic factors influencing water level fluctuations in
      lakes, particularly in semi-arid regions?
    sentences:
      - >-
        The main climatic factors influencing water level fluctuations in lakes
        in semi-arid regions include potential evapotranspiration,
        precipitation, temperature, and vapor pressure.
      - >-
        Bias correction improves the accuracy of satellite precipitation data,
        enhancing its effectiveness in streamflow simulation.
      - >-
        Climate change is associated with an increase in the frequency and
        intensity of extreme rainfall events, although regional variations can
        complicate the detection of consistent trends.
  - source_sentence: What is the purpose of the WATYIELD model in hydrology?
    sentences:
      - >-
        Different precipitation datasets can lead to significant variations in
        the simulation of blue and green water resources, impacting water
        resource assessment and management.
      - >-
        The WATYIELD model quantifies the impact of land use changes on stream
        discharge, facilitating predictions based on alterations in vegetation
        cover.
      - >-
        Antecedent wetness conditions influence the timing and magnitude of DOC
        mobilization, with wetter conditions leading to faster and higher DOC
        export compared to drier conditions, which cause delays and reduced
        export.
  - source_sentence: >-
      How does deep groundwater discharge influence solute budgets in
      mountainous watersheds?
    sentences:
      - >-
        Deep groundwater discharge contributes significant solute loads to
        streams, affecting water quality and ecological health.
      - >-
        Strategies include adaptive cooperation, information sharing, water
        conservation, development of alternative water sources, and flexible
        water allocation policies.
      - >-
        Groundwater storage depletion can be influenced by land use changes,
        groundwater abstraction, and decreases in precipitation due to climate
        change.
  - source_sentence: >-
      How can uncertainty in predictive modeling of seawater intrusion be
      effectively quantified and managed in coastal aquifers?
    sentences:
      - >-
        By employing optimized sampling strategies and methods like Null Space
        Monte Carlo to explore parameter spaces while integrating diverse
        measurement data.
      - >-
        Factors include operational costs, potential losses from dam breaches,
        benefits provided by the dam, and social impacts on local communities.
      - >-
        The relative permeability is influenced by phase saturation, wettability
        conditions, capillary number, and the interfacial area between the two
        fluids.
  - source_sentence: What is the relationship between groundwater and streamflow?
    sentences:
      - >-
        Long-chain alkanes and their stable hydrogen isotopes reflect variations
        in vegetation types and moisture sources, providing insights into
        historical precipitation patterns and climatic conditions.
      - >-
        A floating vegetation canopy alters flow dynamics and increases near-bed
        turbulent kinetic energy, which can lead to sediment resuspension and
        reduced deposition beneath the canopy.
      - >-
        Groundwater can sustain streamflow during dry periods, while streams can
        also contribute water back to groundwater through infiltration.
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/all-MiniLM-L6-v2
Maximum Sequence Length: 256 tokens
Output Dimensionality: 384 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("HydroEmbed/HydroEmbed-OpenQA-MiniLM-DualLoss")
# Run inference
sentences = [
    'What is the relationship between groundwater and streamflow?',
    'Groundwater can sustain streamflow during dry periods, while streams can also contribute water back to groundwater through infiltration.',
    'A floating vegetation canopy alters flow dynamics and increases near-bed turbulent kinetic energy, which can lead to sediment resuspension and reduced deposition beneath the canopy.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Datasets

Unnamed Dataset

Size: 2,169 training samples
Columns: sentence_0, sentence_1, and label
Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 label
type string string float
details
min: 11 tokens
mean: 23.44 tokens
max: 45 tokens

min: 16 tokens
mean: 33.55 tokens
max: 71 tokens

min: 1.0
mean: 1.0
max: 1.0

	sentence_0	sentence_1	label
type	string	string	float
details	min: 11 tokens mean: 23.44 tokens max: 45 tokens	min: 16 tokens mean: 33.55 tokens max: 71 tokens	min: 1.0 mean: 1.0 max: 1.0

Samples:

sentence_0	sentence_1	label
`How can deep learning technologies improve the identification and management of unregulated private pumping wells in groundwater systems?`	`Deep learning technologies can accurately detect and map private pumping wells using image data, enhancing groundwater management by providing spatial distribution insights and reducing the labor-intensive nature of traditional investigations.`	`1.0`
`How does solar-induced chlorophyll fluorescence relate to vegetation transpiration across different land cover types and environmental conditions?`	`Solar-induced chlorophyll fluorescence exhibits a robust linear correlation with vegetation transpiration, which is influenced by land cover types and various environmental factors, showing higher sensitivity in C4 compared to C3 vegetation.`	`1.0`
`How does soil salinity affect the accuracy of soil moisture measurements from different sensing technologies and satellite products?`	`Soil salinity introduces significant errors in dielectric-based soil moisture measurements, with L-band products being more affected than C-band products.`	`1.0`

Loss: CosineSimilarityLoss with these parameters:

{
    "loss_fct": "torch.nn.modules.loss.MSELoss"
}

Unnamed Dataset

Size: 2,169 training samples
Columns: sentence_0, sentence_1, and label
Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 label
type string string float
details
min: 11 tokens
mean: 23.58 tokens
max: 47 tokens

min: 15 tokens
mean: 33.32 tokens
max: 63 tokens

min: 1.0
mean: 1.0
max: 1.0

	sentence_0	sentence_1	label
type	string	string	float
details	min: 11 tokens mean: 23.58 tokens max: 47 tokens	min: 15 tokens mean: 33.32 tokens max: 63 tokens	min: 1.0 mean: 1.0 max: 1.0

Samples:

sentence_0	sentence_1	label
`How does climate change impact agricultural water supply and demand in arid and semi-arid regions?`	`Climate change exacerbates agricultural water scarcity by increasing evaporation rates and altering precipitation patterns, leading to a higher agricultural water demand while potentially reducing the available water supply.`	`1.0`
`How do changes in land use and climate affect river discharge dynamics in Mediterranean catchments?`	`Changes in land use and climate primarily influence river discharge dynamics by altering vegetation cover and its associated water consumption, leading to significant reductions in discharge despite minor changes in precipitation.`	`1.0`
`Why is it important to regularly update rating curves in hydrological studies?`	`Regular updates ensure that changes in river bed profiles or other environmental factors are accurately reflected in discharge estimations.`	`1.0`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 64
per_device_eval_batch_size: 64
num_train_epochs: 20
fp16: True
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 64
per_device_eval_batch_size: 64
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 20
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
tp_size: 0
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

Training Logs

Epoch	Step	Training Loss
7.3529	500	0.094
14.7059	1000	0.0339

Framework Versions

Python: 3.11.1
Sentence Transformers: 4.1.0
Transformers: 4.51.3
PyTorch: 2.7.0+cu118
Accelerate: 1.6.0
Datasets: 3.5.1
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}