MARK-Embedding / README.md

Update README.md

f2ab02d verified 5 months ago

12.7 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- dense
	- generated_from_trainer
	- dataset_size:5749
	- loss:CosineSimilarityLoss
	widget:
	- source_sentence: >-
	Nterprise Linux Services is expected to be available before then end of this
	year.
	sentences:
	- >-
	Beta versions of Nterprise Linux Services are expected to be available on
	certain HP ProLiant servers in July.
	- Spain turning back the clock on siestas
	- I don't like many flavored drinks.
	- source_sentence: Iran hopes nuclear talks will yield 'roadmap'
	sentences:
	- Iran Nuclear Talks in Geneva Spur High Hopes
	- A black pet dog runs around in the garden of a house.
	- >-
	The witness was a 27-year-old Kosovan parking attendant, who was paid by the
	News of the World, the court heard.
	- source_sentence: Hamas Urges Hizbullah to Pull Fighters Out of Syria
	sentences:
	- >-
	"This was a persistent problem which has not been solved, mechanically and
	physically," said board member Steven Wallace.
	- A small dog jumps over a yellow beam.
	- Hamas calls on Hezbollah to pull forces out of Syria
	- source_sentence: Licensing revenue slid 21 percent, however, to $107.6 million.
	sentences:
	- Britain loses bid to deport radical cleric Abu Qatada
	- A man sits on a bed very close to a small television.
	- License sales, a key measure of demand, fell 21 percent to $107.6 million.
	- source_sentence: >-
	Comcast Class A shares were up 8 cents at $30.50 in morning trading on the
	Nasdaq Stock Market.
	sentences:
	- The stock rose 48 cents to $30 yesterday in Nasdaq Stock Market trading.
	- 'Malaysia: Chinese satellite found object in ocean'
	- A boy in a robe sits in a chair.
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	metrics:
	- pearson_cosine
	- spearman_cosine
	model-index:
	- name: SentenceTransformer
	results:
	- task:
	type: semantic-similarity
	name: 意味的類似性 (Semantic Similarity)
	metrics:
	- type: pearson_cosine
	value: 0.4639747212598005
	name: ピアソン相関係数 (コサイン類似度)
	- type: spearman_cosine
	value: 0.4595105448711385
	name: スピアマン相関係数 (コサイン類似度)
	license: gemma
	---

	# SentenceTransformer

	これは、訓練済みの[sentence-transformers](https://www.SBERT.net)モデルです。このモデルは、文と段落を256次元の密なベクトル空間にマッピングし、意味的テキスト類似性、意味検索、言い換えマイニング、テキスト分類、クラスタリングなどに使用できます。

	## モデル詳細

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/67761d25fb96b78ed6839812/09eHVFfvEDC4ChNzD_n6K.png)

	### モデルの説明
	- モデルタイプ: Sentence Transformer
	- 最大シーケンス長: 2048トークン
	- 出力次元数: 256次元
	- 類似度関数: コサイン類似度

	### モデルのソース

	- ドキュメント: [Sentence Transformers Documentation](https://sbert.net)
	- リポジトリ: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
	- Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

	### 完全なモデルアーキテクチャ

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
	(1): Pooling({'word_embedding_dimension': 256, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	)
	```

	## 使用方法

	### 直接使用 (Sentence Transformers)

	まず、Sentence Transformersライブラリをインストールします:

	```bash
	pip install -U sentence-transformers
	```

	次に、このモデルをロードして推論を実行できます。
	```python
	from sentence_transformers import SentenceTransformer

	# 🤗 Hubからダウンロード
	model = SentenceTransformer("sentence_transformers_model_id")
	# 推論を実行
	sentences = [
	'Comcast Class A shares were up 8 cents at $30.50 in morning trading on the Nasdaq Stock Market.',
	'The stock rose 48 cents to $30 yesterday in Nasdaq Stock Market trading.',
	'Malaysia: Chinese satellite found object in ocean',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 256]

	# 埋め込みベクトルの類似度スコアを取得
	similarities = model.similarity(embeddings, embeddings)
	print(similarities)
	# tensor([[1.0000, 0.5752, 0.2980],
	# [0.5752, 1.0000, 0.2161],
	# [0.2980, 0.2161, 1.0000]])
	```

	## 評価

	### メトリクス

	#### 意味的類似性

	* [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)で評価

	\| メトリクス \| 値 \|
	\|:--------------------\|:-----------\|
	\| pearson_cosine \| 0.464 \|
	\| spearman_cosine \| 0.4595 \|

	## 訓練詳細

	### 訓練データセット

	#### 名称未設定のデータセット

	* サイズ: 5,749 訓練サンプル
	* カラム: `sentence_0`, `sentence_1`, `label`
	* 最初の1000サンプルに基づくおおよその統計:
	\| \| sentence_0 \| sentence_1 \| label \|
	\|:--------\|:----------------------------------------------------------------------------------\|:----------------------------------------------------------------------------------\|:---------------------------------------------------------------\|
	\| 型 \| string \| string \| float \|
	\| 詳細 \| <ul><li>最小: 6 トークン</li><li>平均: 14.76 トークン</li><li>最大: 55 トークン</li></ul> \| <ul><li>最小: 6 トークン</li><li>平均: 14.73 トークン</li><li>最大: 57 トークン</li></ul> \| <ul><li>最小: 0.0</li><li>平均: 0.55</li><li>最大: 1.0</li></ul> \|
	* サンプル:
	\| sentence_0 \| sentence_1 \| label \|
	\|:----------------------------------------------------------------------------\|:-------------------------------------------------------------------------------------\|:--------------------------------\|
	\| `Forecasters said warnings might go up for Cuba later Thursday.` \| `Watches or warnings could be issued for eastern Cuba later on Thursday.` \| `0.8` \|
	\| `Death toll in Lebanon bombings rises to 47` \| `1 suspect arrested after Lebanon car bombings kill 45` \| `0.5599999904632569` \|
	\| `Three dogs running on a racetrack.` \| `Three dogs round a bend at a racetrack.` \| `0.9600000381469727` \|
	* 損失関数: [<code>CosineSimilarityLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) 以下のパラメータを使用:
	```json
	{
	"loss_fct": "torch.nn.modules.loss.MSELoss"
	}
	```

	### 訓練ハイパーパラメータ
	#### デフォルト以外のハイパーパラメータ

	- `eval_strategy`: steps
	- `per_device_train_batch_size`: 16
	- `per_device_eval_batch_size`: 16
	- `multi_dataset_batch_sampler`: round_robin

	#### すべてのハイパーパラメータ
	<details><summary>クリックして展開</summary>

	- `overwrite_output_dir`: False
	- `do_predict`: False
	- `eval_strategy`: steps
	- `prediction_loss_only`: True
	- `per_device_train_batch_size`: 16
	- `per_device_eval_batch_size`: 16
	- `per_gpu_train_batch_size`: None
	- `per_gpu_eval_batch_size`: None
	- `gradient_accumulation_steps`: 1
	- `eval_accumulation_steps`: None
	- `torch_empty_cache_steps`: None
	- `learning_rate`: 5e-05
	- `weight_decay`: 0.0
	- `adam_beta1`: 0.9
	- `adam_beta2`: 0.999
	- `adam_epsilon`: 1e-08
	- `max_grad_norm`: 1
	- `num_train_epochs`: 3
	- `max_steps`: -1
	- `lr_scheduler_type`: linear
	- `lr_scheduler_kwargs`: {}
	- `warmup_ratio`: 0.0
	- `warmup_steps`: 0
	- `log_level`: passive
	- `log_level_replica`: warning
	- `log_on_each_node`: True
	- `logging_nan_inf_filter`: True
	- `save_safetensors`: True
	- `save_on_each_node`: False
	- `save_only_model`: False
	- `restore_callback_states_from_checkpoint`: False
	- `no_cuda`: False
	- `use_cpu`: False
	- `use_mps_device`: False
	- `seed`: 42
	- `data_seed`: None
	- `jit_mode_eval`: False
	- `use_ipex`: False
	- `bf16`: False
	- `fp16`: False
	- `fp16_opt_level`: O1
	- `half_precision_backend`: auto
	- `bf16_full_eval`: False
	- `fp16_full_eval`: False
	- `tf32`: None
	- `local_rank`: 0
	- `ddp_backend`: None
	- `tpu_num_cores`: None
	- `tpu_metrics_debug`: False
	- `debug`: []
	- `dataloader_drop_last`: False
	- `dataloader_num_workers`: 0
	- `dataloader_prefetch_factor`: None
	- `past_index`: -1
	- `disable_tqdm`: False
	- `remove_unused_columns`: True
	- `label_names`: None
	- `load_best_model_at_end`: False
	- `ignore_data_skip`: False
	- `fsdp`: []
	- `fsdp_min_num_params`: 0
	- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
	- `fsdp_transformer_layer_cls_to_wrap`: None
	- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
	- `parallelism_config`: None
	- `deepspeed`: None
	- `label_smoothing_factor`: 0.0
	- `optim`: adamw_torch_fused
	- `optim_args`: None
	- `adafactor`: False
	- `group_by_length`: False
	- `length_column_name`: length
	- `ddp_find_unused_parameters`: None
	- `ddp_bucket_cap_mb`: None
	- `ddp_broadcast_buffers`: False
	- `dataloader_pin_memory`: True
	- `dataloader_persistent_workers`: False
	- `skip_memory_metrics`: True
	- `use_legacy_prediction_loop`: False
	- `push_to_hub`: False
	- `resume_from_checkpoint`: None
	- `hub_model_id`: None
	- `hub_strategy`: every_save
	- `hub_private_repo`: None
	- `hub_always_push`: False
	- `hub_revision`: None
	- `gradient_checkpointing`: False
	- `gradient_checkpointing_kwargs`: None
	- `include_inputs_for_metrics`: False
	- `include_for_metrics`: []
	- `eval_do_concat_batches`: True
	- `fp16_backend`: auto
	- `push_to_hub_model_id`: None
	- `push_to_hub_organization`: None
	- `mp_parameters`:
	- `auto_find_batch_size`: False
	- `full_determinism`: False
	- `torchdynamo`: None
	- `ray_scope`: last
	- `ddp_timeout`: 1800
	- `torch_compile`: False
	- `torch_compile_backend`: None
	- `torch_compile_mode`: None
	- `include_tokens_per_second`: False
	- `include_num_input_tokens_seen`: False
	- `neftune_noise_alpha`: None
	- `optim_target_modules`: None
	- `batch_eval_metrics`: False
	- `eval_on_start`: False
	- `use_liger_kernel`: False
	- `liger_kernel_config`: None
	- `eval_use_gather_object`: False
	- `average_tokens_across_devices`: False
	- `prompts`: None
	- `batch_sampler`: batch_sampler
	- `multi_dataset_batch_sampler`: round_robin
	- `router_mapping`: {}
	- `learning_rate_mapping`: {}

	</details>

	### 訓練ログ
	\| エポック \| ステップ \| 訓練損失 \| spearman_cosine \|
	\|:------:\|:----:\|:-------------:\|:---------------:\|
	\| 1.0 \| 360 \| - \| 0.2967 \|
	\| 1.3889 \| 500 \| 0.11 \| 0.3338 \|
	\| 2.0 \| 720 \| - \| 0.3665 \|
	\| 2.7778 \| 1000 \| 0.0857 \| 0.4101 \|
	\| 3.0 \| 1080 \| - \| 0.4595 \|


	### フレームワークのバージョン
	- Python: 3.12.11
	- Sentence Transformers: 5.1.0
	- Transformers: 4.56.1
	- PyTorch: 2.8.0+cu126
	- Accelerate: 1.10.1
	- Datasets: 4.0.0
	- Tokenizers: 0.22.0

	## 引用

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```