hf-e5-bible-150 / README.md

Upload hf-e5-bible-150 embedding model

254f1e4 verified 21 days ago

17 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- dense
	- generated_from_trainer
	- dataset_size:262023
	- loss:MultipleNegativesRankingLoss
	base_model: intfloat/e5-base-v2
	widget:
	- source_sentence: 'query: Ezekiel Prophecies of Ezekiel'
	sentences:
	- 'passage: Then he went to the east gate. He climbed its steps and measured the
	threshold of the gate; it was one rod deep.'
	- 'passage: But if you do not obey the Lord, and if you rebel against his commands,
	his hand will be against you, as it was against your ancestors.'
	- 'passage: When you were dead in your sins and in the uncircumcision of your flesh,
	God made you alive with Christ. He forgave us all our sins,'
	- source_sentence: 'query: The event ''Prophecies of Nahum'' as recorded in Scripture,
	involving Nahum.'
	sentences:
	- "passage: Nothing can heal you;\n your wound is fatal.\nAll who hear the news\
	\ about you\n clap their hands at your fall,\nfor who has not felt\n your\
	\ endless cruelty?"
	- 'passage: When David was told of this, he gathered all Israel and crossed the
	Jordan; he advanced against them and formed his battle lines opposite them. David
	formed his lines to meet the Arameans in battle, and they fought against him.'
	- 'passage: Then the king of Assyria sent his field commander with a large army
	from Lachish to King Hezekiah at Jerusalem. When the commander stopped at the
	aqueduct of the Upper Pool, on the road to the Launderer’s Field,'
	- source_sentence: 'query: what happened to Job'
	sentences:
	- "passage: If I hold my head high, you stalk me like a lion\n and again display\
	\ your awesome power against me."
	- "passage: But Job has not marshaled his words against me,\n and I will not\
	\ answer him with your arguments."
	- "passage: I will pronounce my judgments on my people\n because of their wickedness\
	\ in forsaking me,\nin burning incense to other gods\n and in worshiping what\
	\ their hands have made."
	- source_sentence: 'query: what happened at peter meets cornelius'
	sentences:
	- 'passage: From the descendants of Bani:

	Maadai, Amram, Uel,'
	- 'passage: until I come and take you to a land like your own—a land of grain and
	new wine, a land of bread and vineyards.'
	- 'passage: So get up and go downstairs. Do not hesitate to go with them, for I
	have sent them.”'
	- source_sentence: 'query: Ahaz'
	sentences:
	- 'passage: We boarded a ship from Adramyttium about to sail for ports along the
	coast of the province of Asia, and we put out to sea. Aristarchus, a Macedonian
	from Thessalonica, was with us.'
	- 'passage: This is what the Lord says: “If those who do not deserve to drink the
	cup must drink it, why should you go unpunished? You will not go unpunished, but
	must drink it.'
	- 'passage: Ahaz sent messengers to say to Tiglath-Pileser king of Assyria, “I am
	your servant and vassal. Come up and save me out of the hand of the king of Aram
	and of the king of Israel, who are attacking me.”'
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	---

	# SentenceTransformer based on intfloat/e5-base-v2

	This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2) <!-- at revision f52bf8ec8c7124536f0efb74aca902b2995e5bcd -->
	- Maximum Sequence Length: 256 tokens
	- Output Dimensionality: 768 dimensions
	- Similarity Function: Cosine Similarity
	<!-- - Training Dataset: Unknown -->
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->

	### Model Sources

	- Documentation: [Sentence Transformers Documentation](https://sbert.net)
	- Repository: [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
	- Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
	(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	(2): Normalize()
	)
	```

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("sentence_transformers_model_id")
	# Run inference
	sentences = [
	'query: Ahaz',
	'passage: Ahaz sent messengers to say to Tiglath-Pileser king of Assyria, “I am your servant and vassal. Come up and save me out of the hand of the king of Aram and of the king of Israel, who are attacking me.”',
	'passage: We boarded a ship from Adramyttium about to sail for ports along the coast of the province of Asia, and we put out to sea. Aristarchus, a Macedonian from Thessalonica, was with us.',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 768]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities)
	# tensor([[1.0000, 0.5851, 0.2630],
	# [0.5851, 1.0000, 0.3747],
	# [0.2630, 0.3747, 1.0000]])
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Dataset

	#### Unnamed Dataset

	* Size: 262,023 training samples
	* Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| sentence_0 \| sentence_1 \| label \|
	\|:--------\|:-----------------------------------------------------------------------------------\|:----------------------------------------------------------------------------------\|:--------------------------------------------------------------\|
	\| type \| string \| string \| float \|
	\| details \| <ul><li>min: 5 tokens</li><li>mean: 26.46 tokens</li><li>max: 256 tokens</li></ul> \| <ul><li>min: 7 tokens</li><li>mean: 34.73 tokens</li><li>max: 82 tokens</li></ul> \| <ul><li>min: 1.0</li><li>mean: 1.0</li><li>max: 1.0</li></ul> \|
	* Samples:
	\| sentence_0 \| sentence_1 \| label \|
	\|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|:-----------------\|
	\| <code>query: Gilead</code> \| <code>passage: Now Elijah the Tishbite, from Tishbe in Gilead, said to Ahab, “As the Lord, the God of Israel, lives, whom I serve, there will be neither dew nor rain in the next few years except at my word.”</code> \| <code>1.0</code> \|
	\| <code>query: Canaanites: The descendants of Canaan, the son of Ham. Migrating from their original home, they seem to have reached the Persian Gulf, and to have there sojourned for some time. They thence “spread to the west, across the mountain chain of Lebanon to the very edge of the Mediterranean Sea, occupying all the land which later became Palestine, also to the north-west as far as the mountain chain of Taurus.</code> \| <code>passage: She makes linen garments and sells them,<br> and supplies the merchants with sashes.</code> \| <code>1.0</code> \|
	\| <code>query: who is God</code> \| <code>passage: “‘Observe my Sabbaths and have reverence for my sanctuary. I am the Lord.</code> \| <code>1.0</code> \|
	* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
	```json
	{
	"scale": 20.0,
	"similarity_fct": "cos_sim",
	"gather_across_devices": false
	}
	```

	### Training Hyperparameters
	#### Non-Default Hyperparameters

	- `per_device_train_batch_size`: 32
	- `per_device_eval_batch_size`: 32
	- `num_train_epochs`: 1
	- `max_steps`: 150
	- `multi_dataset_batch_sampler`: round_robin

	#### All Hyperparameters
	<details><summary>Click to expand</summary>

	- `overwrite_output_dir`: False
	- `do_predict`: False
	- `eval_strategy`: no
	- `prediction_loss_only`: True
	- `per_device_train_batch_size`: 32
	- `per_device_eval_batch_size`: 32
	- `per_gpu_train_batch_size`: None
	- `per_gpu_eval_batch_size`: None
	- `gradient_accumulation_steps`: 1
	- `eval_accumulation_steps`: None
	- `torch_empty_cache_steps`: None
	- `learning_rate`: 5e-05
	- `weight_decay`: 0.0
	- `adam_beta1`: 0.9
	- `adam_beta2`: 0.999
	- `adam_epsilon`: 1e-08
	- `max_grad_norm`: 1
	- `num_train_epochs`: 1
	- `max_steps`: 150
	- `lr_scheduler_type`: linear
	- `lr_scheduler_kwargs`: None
	- `warmup_ratio`: 0.0
	- `warmup_steps`: 0
	- `log_level`: passive
	- `log_level_replica`: warning
	- `log_on_each_node`: True
	- `logging_nan_inf_filter`: True
	- `save_safetensors`: True
	- `save_on_each_node`: False
	- `save_only_model`: False
	- `restore_callback_states_from_checkpoint`: False
	- `no_cuda`: False
	- `use_cpu`: False
	- `use_mps_device`: False
	- `seed`: 42
	- `data_seed`: None
	- `jit_mode_eval`: False
	- `bf16`: False
	- `fp16`: False
	- `fp16_opt_level`: O1
	- `half_precision_backend`: auto
	- `bf16_full_eval`: False
	- `fp16_full_eval`: False
	- `tf32`: None
	- `local_rank`: 0
	- `ddp_backend`: None
	- `tpu_num_cores`: None
	- `tpu_metrics_debug`: False
	- `debug`: []
	- `dataloader_drop_last`: False
	- `dataloader_num_workers`: 0
	- `dataloader_prefetch_factor`: None
	- `past_index`: -1
	- `disable_tqdm`: False
	- `remove_unused_columns`: True
	- `label_names`: None
	- `load_best_model_at_end`: False
	- `ignore_data_skip`: False
	- `fsdp`: []
	- `fsdp_min_num_params`: 0
	- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
	- `fsdp_transformer_layer_cls_to_wrap`: None
	- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
	- `parallelism_config`: None
	- `deepspeed`: None
	- `label_smoothing_factor`: 0.0
	- `optim`: adamw_torch_fused
	- `optim_args`: None
	- `adafactor`: False
	- `group_by_length`: False
	- `length_column_name`: length
	- `project`: huggingface
	- `trackio_space_id`: trackio
	- `ddp_find_unused_parameters`: None
	- `ddp_bucket_cap_mb`: None
	- `ddp_broadcast_buffers`: False
	- `dataloader_pin_memory`: True
	- `dataloader_persistent_workers`: False
	- `skip_memory_metrics`: True
	- `use_legacy_prediction_loop`: False
	- `push_to_hub`: False
	- `resume_from_checkpoint`: None
	- `hub_model_id`: None
	- `hub_strategy`: every_save
	- `hub_private_repo`: None
	- `hub_always_push`: False
	- `hub_revision`: None
	- `gradient_checkpointing`: False
	- `gradient_checkpointing_kwargs`: None
	- `include_inputs_for_metrics`: False
	- `include_for_metrics`: []
	- `eval_do_concat_batches`: True
	- `fp16_backend`: auto
	- `push_to_hub_model_id`: None
	- `push_to_hub_organization`: None
	- `mp_parameters`:
	- `auto_find_batch_size`: False
	- `full_determinism`: False
	- `torchdynamo`: None
	- `ray_scope`: last
	- `ddp_timeout`: 1800
	- `torch_compile`: False
	- `torch_compile_backend`: None
	- `torch_compile_mode`: None
	- `include_tokens_per_second`: False
	- `include_num_input_tokens_seen`: no
	- `neftune_noise_alpha`: None
	- `optim_target_modules`: None
	- `batch_eval_metrics`: False
	- `eval_on_start`: False
	- `use_liger_kernel`: False
	- `liger_kernel_config`: None
	- `eval_use_gather_object`: False
	- `average_tokens_across_devices`: True
	- `prompts`: None
	- `batch_sampler`: batch_sampler
	- `multi_dataset_batch_sampler`: round_robin
	- `router_mapping`: {}
	- `learning_rate_mapping`: {}

	</details>

	### Framework Versions
	- Python: 3.11.14
	- Sentence Transformers: 5.2.0
	- Transformers: 4.57.6
	- PyTorch: 2.10.0+cpu
	- Accelerate: 1.12.0
	- Datasets: 4.5.0
	- Tokenizers: 0.22.2

	## Citation

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### MultipleNegativesRankingLoss
	```bibtex
	@misc{henderson2017efficient,
	title={Efficient Natural Language Response Suggestion for Smart Reply},
	author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
	year={2017},
	eprint={1705.00652},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->