HydroEmbed-OpenQA-MiniLM-DualLoss / README.md

Add new SentenceTransformer model

8ec9e55 verified 8 months ago

19.9 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- generated_from_trainer
	- dataset_size:4338
	- loss:CosineSimilarityLoss
	- loss:MultipleNegativesRankingLoss
	base_model: sentence-transformers/all-MiniLM-L6-v2
	widget:
	- source_sentence: What are the main climatic factors influencing water level fluctuations
	in lakes, particularly in semi-arid regions?
	sentences:
	- The main climatic factors influencing water level fluctuations in lakes in semi-arid
	regions include potential evapotranspiration, precipitation, temperature, and
	vapor pressure.
	- Bias correction improves the accuracy of satellite precipitation data, enhancing
	its effectiveness in streamflow simulation.
	- Climate change is associated with an increase in the frequency and intensity of
	extreme rainfall events, although regional variations can complicate the detection
	of consistent trends.
	- source_sentence: What is the purpose of the WATYIELD model in hydrology?
	sentences:
	- Different precipitation datasets can lead to significant variations in the simulation
	of blue and green water resources, impacting water resource assessment and management.
	- The WATYIELD model quantifies the impact of land use changes on stream discharge,
	facilitating predictions based on alterations in vegetation cover.
	- Antecedent wetness conditions influence the timing and magnitude of DOC mobilization,
	with wetter conditions leading to faster and higher DOC export compared to drier
	conditions, which cause delays and reduced export.
	- source_sentence: How does deep groundwater discharge influence solute budgets in
	mountainous watersheds?
	sentences:
	- Deep groundwater discharge contributes significant solute loads to streams, affecting
	water quality and ecological health.
	- Strategies include adaptive cooperation, information sharing, water conservation,
	development of alternative water sources, and flexible water allocation policies.
	- Groundwater storage depletion can be influenced by land use changes, groundwater
	abstraction, and decreases in precipitation due to climate change.
	- source_sentence: How can uncertainty in predictive modeling of seawater intrusion
	be effectively quantified and managed in coastal aquifers?
	sentences:
	- By employing optimized sampling strategies and methods like Null Space Monte Carlo
	to explore parameter spaces while integrating diverse measurement data.
	- Factors include operational costs, potential losses from dam breaches, benefits
	provided by the dam, and social impacts on local communities.
	- The relative permeability is influenced by phase saturation, wettability conditions,
	capillary number, and the interfacial area between the two fluids.
	- source_sentence: What is the relationship between groundwater and streamflow?
	sentences:
	- Long-chain alkanes and their stable hydrogen isotopes reflect variations in vegetation
	types and moisture sources, providing insights into historical precipitation patterns
	and climatic conditions.
	- A floating vegetation canopy alters flow dynamics and increases near-bed turbulent
	kinetic energy, which can lead to sediment resuspension and reduced deposition
	beneath the canopy.
	- Groundwater can sustain streamflow during dry periods, while streams can also
	contribute water back to groundwater through infiltration.
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	---

	# SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

	This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) <!-- at revision c9745ed1d9f207416be6d2e6f8de32d1f16199bf -->
	- Maximum Sequence Length: 256 tokens
	- Output Dimensionality: 384 dimensions
	- Similarity Function: Cosine Similarity
	<!-- - Training Dataset: Unknown -->
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->

	### Model Sources

	- Documentation: [Sentence Transformers Documentation](https://sbert.net)
	- Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
	- Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
	(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	(2): Normalize()
	)
	```

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("HydroEmbed/HydroEmbed-OpenQA-MiniLM-DualLoss")
	# Run inference
	sentences = [
	'What is the relationship between groundwater and streamflow?',
	'Groundwater can sustain streamflow during dry periods, while streams can also contribute water back to groundwater through infiltration.',
	'A floating vegetation canopy alters flow dynamics and increases near-bed turbulent kinetic energy, which can lead to sediment resuspension and reduced deposition beneath the canopy.',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 384]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Datasets

	#### Unnamed Dataset

	* Size: 2,169 training samples
	* Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| sentence_0 \| sentence_1 \| label \|
	\|:--------\|:-----------------------------------------------------------------------------------\|:-----------------------------------------------------------------------------------\|:--------------------------------------------------------------\|
	\| type \| string \| string \| float \|
	\| details \| <ul><li>min: 11 tokens</li><li>mean: 23.44 tokens</li><li>max: 45 tokens</li></ul> \| <ul><li>min: 16 tokens</li><li>mean: 33.55 tokens</li><li>max: 71 tokens</li></ul> \| <ul><li>min: 1.0</li><li>mean: 1.0</li><li>max: 1.0</li></ul> \|
	* Samples:
	\| sentence_0 \| sentence_1 \| label \|
	\|:----------------------------------------------------------------------------------------------------------------------------------------------------------------\|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|:-----------------\|
	\| <code>How can deep learning technologies improve the identification and management of unregulated private pumping wells in groundwater systems?</code> \| <code>Deep learning technologies can accurately detect and map private pumping wells using image data, enhancing groundwater management by providing spatial distribution insights and reducing the labor-intensive nature of traditional investigations.</code> \| <code>1.0</code> \|
	\| <code>How does solar-induced chlorophyll fluorescence relate to vegetation transpiration across different land cover types and environmental conditions?</code> \| <code>Solar-induced chlorophyll fluorescence exhibits a robust linear correlation with vegetation transpiration, which is influenced by land cover types and various environmental factors, showing higher sensitivity in C4 compared to C3 vegetation.</code> \| <code>1.0</code> \|
	\| <code>How does soil salinity affect the accuracy of soil moisture measurements from different sensing technologies and satellite products?</code> \| <code>Soil salinity introduces significant errors in dielectric-based soil moisture measurements, with L-band products being more affected than C-band products.</code> \| <code>1.0</code> \|
	* Loss: [<code>CosineSimilarityLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with these parameters:
	```json
	{
	"loss_fct": "torch.nn.modules.loss.MSELoss"
	}
	```

	#### Unnamed Dataset

	* Size: 2,169 training samples
	* Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| sentence_0 \| sentence_1 \| label \|
	\|:--------\|:-----------------------------------------------------------------------------------\|:-----------------------------------------------------------------------------------\|:--------------------------------------------------------------\|
	\| type \| string \| string \| float \|
	\| details \| <ul><li>min: 11 tokens</li><li>mean: 23.58 tokens</li><li>max: 47 tokens</li></ul> \| <ul><li>min: 15 tokens</li><li>mean: 33.32 tokens</li><li>max: 63 tokens</li></ul> \| <ul><li>min: 1.0</li><li>mean: 1.0</li><li>max: 1.0</li></ul> \|
	* Samples:
	\| sentence_0 \| sentence_1 \| label \|
	\|:-----------------------------------------------------------------------------------------------------------------\|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|:-----------------\|
	\| <code>How does climate change impact agricultural water supply and demand in arid and semi-arid regions?</code> \| <code>Climate change exacerbates agricultural water scarcity by increasing evaporation rates and altering precipitation patterns, leading to a higher agricultural water demand while potentially reducing the available water supply.</code> \| <code>1.0</code> \|
	\| <code>How do changes in land use and climate affect river discharge dynamics in Mediterranean catchments?</code> \| <code>Changes in land use and climate primarily influence river discharge dynamics by altering vegetation cover and its associated water consumption, leading to significant reductions in discharge despite minor changes in precipitation.</code> \| <code>1.0</code> \|
	\| <code>Why is it important to regularly update rating curves in hydrological studies?</code> \| <code>Regular updates ensure that changes in river bed profiles or other environmental factors are accurately reflected in discharge estimations.</code> \| <code>1.0</code> \|
	* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
	```json
	{
	"scale": 20.0,
	"similarity_fct": "cos_sim"
	}
	```

	### Training Hyperparameters
	#### Non-Default Hyperparameters

	- `per_device_train_batch_size`: 64
	- `per_device_eval_batch_size`: 64
	- `num_train_epochs`: 20
	- `fp16`: True
	- `multi_dataset_batch_sampler`: round_robin

	#### All Hyperparameters
	<details><summary>Click to expand</summary>

	- `overwrite_output_dir`: False
	- `do_predict`: False
	- `eval_strategy`: no
	- `prediction_loss_only`: True
	- `per_device_train_batch_size`: 64
	- `per_device_eval_batch_size`: 64
	- `per_gpu_train_batch_size`: None
	- `per_gpu_eval_batch_size`: None
	- `gradient_accumulation_steps`: 1
	- `eval_accumulation_steps`: None
	- `torch_empty_cache_steps`: None
	- `learning_rate`: 5e-05
	- `weight_decay`: 0.0
	- `adam_beta1`: 0.9
	- `adam_beta2`: 0.999
	- `adam_epsilon`: 1e-08
	- `max_grad_norm`: 1
	- `num_train_epochs`: 20
	- `max_steps`: -1
	- `lr_scheduler_type`: linear
	- `lr_scheduler_kwargs`: {}
	- `warmup_ratio`: 0.0
	- `warmup_steps`: 0
	- `log_level`: passive
	- `log_level_replica`: warning
	- `log_on_each_node`: True
	- `logging_nan_inf_filter`: True
	- `save_safetensors`: True
	- `save_on_each_node`: False
	- `save_only_model`: False
	- `restore_callback_states_from_checkpoint`: False
	- `no_cuda`: False
	- `use_cpu`: False
	- `use_mps_device`: False
	- `seed`: 42
	- `data_seed`: None
	- `jit_mode_eval`: False
	- `use_ipex`: False
	- `bf16`: False
	- `fp16`: True
	- `fp16_opt_level`: O1
	- `half_precision_backend`: auto
	- `bf16_full_eval`: False
	- `fp16_full_eval`: False
	- `tf32`: None
	- `local_rank`: 0
	- `ddp_backend`: None
	- `tpu_num_cores`: None
	- `tpu_metrics_debug`: False
	- `debug`: []
	- `dataloader_drop_last`: False
	- `dataloader_num_workers`: 0
	- `dataloader_prefetch_factor`: None
	- `past_index`: -1
	- `disable_tqdm`: False
	- `remove_unused_columns`: True
	- `label_names`: None
	- `load_best_model_at_end`: False
	- `ignore_data_skip`: False
	- `fsdp`: []
	- `fsdp_min_num_params`: 0
	- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
	- `tp_size`: 0
	- `fsdp_transformer_layer_cls_to_wrap`: None
	- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
	- `deepspeed`: None
	- `label_smoothing_factor`: 0.0
	- `optim`: adamw_torch
	- `optim_args`: None
	- `adafactor`: False
	- `group_by_length`: False
	- `length_column_name`: length
	- `ddp_find_unused_parameters`: None
	- `ddp_bucket_cap_mb`: None
	- `ddp_broadcast_buffers`: False
	- `dataloader_pin_memory`: True
	- `dataloader_persistent_workers`: False
	- `skip_memory_metrics`: True
	- `use_legacy_prediction_loop`: False
	- `push_to_hub`: False
	- `resume_from_checkpoint`: None
	- `hub_model_id`: None
	- `hub_strategy`: every_save
	- `hub_private_repo`: None
	- `hub_always_push`: False
	- `gradient_checkpointing`: False
	- `gradient_checkpointing_kwargs`: None
	- `include_inputs_for_metrics`: False
	- `include_for_metrics`: []
	- `eval_do_concat_batches`: True
	- `fp16_backend`: auto
	- `push_to_hub_model_id`: None
	- `push_to_hub_organization`: None
	- `mp_parameters`:
	- `auto_find_batch_size`: False
	- `full_determinism`: False
	- `torchdynamo`: None
	- `ray_scope`: last
	- `ddp_timeout`: 1800
	- `torch_compile`: False
	- `torch_compile_backend`: None
	- `torch_compile_mode`: None
	- `include_tokens_per_second`: False
	- `include_num_input_tokens_seen`: False
	- `neftune_noise_alpha`: None
	- `optim_target_modules`: None
	- `batch_eval_metrics`: False
	- `eval_on_start`: False
	- `use_liger_kernel`: False
	- `eval_use_gather_object`: False
	- `average_tokens_across_devices`: False
	- `prompts`: None
	- `batch_sampler`: batch_sampler
	- `multi_dataset_batch_sampler`: round_robin

	</details>

	### Training Logs
	\| Epoch \| Step \| Training Loss \|
	\|:-------:\|:----:\|:-------------:\|
	\| 7.3529 \| 500 \| 0.094 \|
	\| 14.7059 \| 1000 \| 0.0339 \|


	### Framework Versions
	- Python: 3.11.1
	- Sentence Transformers: 4.1.0
	- Transformers: 4.51.3
	- PyTorch: 2.7.0+cu118
	- Accelerate: 1.6.0
	- Datasets: 3.5.1
	- Tokenizers: 0.21.1

	## Citation

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### MultipleNegativesRankingLoss
	```bibtex
	@misc{henderson2017efficient,
	title={Efficient Natural Language Response Suggestion for Smart Reply},
	author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
	year={2017},
	eprint={1705.00652},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->