Duplicate from thierrydamiba/splade-ecommerce-multidomain

0c7d9f8 4 days ago

18 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- sentence-transformers
	- sparse-encoder
	- sparse
	- splade
	- e-commerce
	- product-search
	- information-retrieval
	- multi-domain
	- dataset_size:99712
	- loss:SpladeLoss
	- loss:SparseMultipleNegativesRankingLoss
	- loss:FlopsLoss
	base_model: distilbert/distilbert-base-uncased
	datasets:
	- tasksource/esci
	- wayfair/wands
	widget:
	- text: '[KIDS TOYLAND] Wooden Dessert Play Set for Kids, Pretend Play Food Sets for
	Birthday Party ,Great for 3, 4, 5, and 6 Year Olds Girls and Boys Wooden Pretend
	Play Food Desserts Set,Wood Dessert Tower and Cakes,Educational Play Food Toys
	for 2 years old kids Birthday Gift<br> <br> <b>Packing Includ:</b><br> cake stand
	1 chocolates and cakes12 <br> <br> <b>Pretend Play Wooden Food Set Features:</b><br>
	This high-quality wooden toy is designed for kids three and up, can be used as
	educational toys for shape matching, counting and concepts of reconstruction.
	<br> <br> 1. size: 9.179.172.2 inch, this beautifully decorated multi shaped
	c'
	- text: mathematical compass
	- text: '[NYX PROFESSIONAL MAKEUP] NYX PROFESSIONAL MAKEUP Lip Lingerie Matte Liquid
	Lipstick - Beauty Mark, Chocolate Brown'
	- text: '[Aladdin] Mrs. Frisby and the Rats of NIMH'
	- text: '[Office Chairs] ginata salon beauty drafting chair'
	pipeline_tag: feature-extraction
	library_name: sentence-transformers
	---

	# SPLADE Multi-Domain E-Commerce Search

	A SPLADE sparse encoder fine-tuned on multiple e-commerce datasets (Amazon ESCI + Wayfair WANDS + Home Depot) for better cross-domain generalization. Trades slight in-domain performance for significantly better generalization across e-commerce domains.

	## Benchmark Results

	### Cross-Domain Performance (vs Single-Domain Model)

	\| Dataset \| Single-Domain \| Multi-Domain \| Improvement \|
	\|---------\|---------------\|------------------\|-------------\|
	\| ESCI (in-domain) \| 0.389 \| 0.372 \| -4% \|
	\| WANDS (Wayfair) \| 0.355 \| 0.366 \| +3% \|
	\| Home Depot \| 0.384 \| 0.410 \| +7% \|

	### vs BM25 Baseline

	\| Dataset \| BM25 \| This Model \| Improvement \|
	\|---------\|------\|----------------\|-------------\|
	\| ESCI \| 0.305 \| 0.372 \| +22% \|
	\| WANDS \| 0.329 \| 0.366 \| +11% \|
	\| Home Depot \| 0.349 \| 0.410 \| +17% \|

	## Model Description

	This is a [SPLADE Sparse Encoder](https://www.sbert.net/docs/sparse_encoder/usage/usage.html) model finetuned from [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) using the [sentence-transformers](https://www.SBERT.net) library. It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.
	## Model Details

	### Model Description
	- Model Type: SPLADE Sparse Encoder
	- Base model: [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) <!-- at revision 12040accade4e8a0f71eabdb258fecc2e7e948be -->
	- Maximum Sequence Length: 512 tokens
	- Output Dimensionality: 30522 dimensions
	- Similarity Function: Dot Product
	<!-- - Training Dataset: Unknown -->
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->

	### Model Sources

	- Documentation: [Sentence Transformers Documentation](https://sbert.net)
	- Documentation: [Sparse Encoder Documentation](https://www.sbert.net/docs/sparse_encoder/usage/usage.html)
	- Repository: [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
	- Hugging Face: [Sparse Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=sparse-encoder)

	### Full Model Architecture

	```
	SparseEncoder(
	(0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'DistilBertForMaskedLM'})
	(1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
	)
	```

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SparseEncoder

	# Download from the 🤗 Hub
	model = SparseEncoder("sparse_encoder_model_id")
	# Run inference
	sentences = [
	'mpow',
	'[Mpow] Wireless Earbuds Active Noise Cancelling, Mpow X3 ANC Bluetooth Earphones w/4 Mics Noise Cancelling, Stereo Earbuds w/Deep Bass, 30Hrs ANC Earbuds w/USB-C Charge, Smart Touch Control, IPX8 Waterproof',
	'[Jerzees] Jerzees Dri-Power Poly Pocketed Open-Bottom Sweatpants, Large - Black 100% Polyester Pre-shrunk Jersey',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 30522]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities)
	# tensor([[ 69.1663, 66.0022, 51.6937],
	# [ 66.0022, 238.3157, 60.5486],
	# [ 51.6937, 60.5486, 174.3004]])
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Dataset

	#### Unnamed Dataset

	* Size: 99,712 training samples
	* Columns: <code>anchor</code> and <code>positive</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| anchor \| positive \|
	\|:--------\|:--------------------------------------------------------------------------------\|:-----------------------------------------------------------------------------------\|
	\| type \| string \| string \|
	\| details \| <ul><li>min: 3 tokens</li><li>mean: 6.2 tokens</li><li>max: 22 tokens</li></ul> \| <ul><li>min: 4 tokens</li><li>mean: 99.84 tokens</li><li>max: 494 tokens</li></ul> \|
	* Samples:
	\| anchor \| positive \|
	\|:---------------------------------------\|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|
	\| <code>bird feeder pole station</code> \| <code>[EXCMARK] EXCMARK 2 Pack Shepherd Hook 32 inch 1/2 inch Thick Use at Weddings, Hanging Solar Lights, Lanterns, Bird Feeders, Metal Hanger Hook (Bronze, 32 inch) <p><b>Create the garden of your dreams with our Shepherds Hooks!</b></p> <p>These amazing hooks with the perfect balance of tradition and versatility are the perfect accessory to any outdoor space! A super easy and convenient way to tackle any outdoor gardening party or event! It will make any hanging object stand out with ultimate beauty. Hang your decorative lights, bird feeders, lanterns, and more!</p> <p>Each hook includes 2 extenders for three height options. The hooks can measure up to 32”</code> \|
	\| <code>chrome bath lighting</code> \| <code>Progress Lighting Archie Collection 2-Light Chrome Bath Light Archie is a standout in any room and provides a fun and fashionable way to light your home. The authentic, prismatic style glass shade diffuses light to provide functional and stylish illumination. This fixture can be installed with the glass facing up or down to suit your preference.California residents: see Proposition 65 informationChrome finishClear prismatic glass17 in. W x 8-3/4 in. HUses (2) 100-Watt medium base bulbs (not included)Fixture can be installed facing upwards or downwards</code> \|
	\| <code>sex toys kinky for female</code> \| <code>[Knaughty Knickers] Knaughty Knickers Daddys Little Lil Fuck Toy Fucktoy DDLG BDSM Owned Boyshort Black 95% combed and ringspun cotton/5% spandex --- Low rise shortie boyshort style panty --- Satin trim fold over elastic waistband --- Custom embelished on quality Bella product --- Super soft and comfortable --- Funny or rude underwear</code> \|
	* Loss: [<code>SpladeLoss</code>](https://sbert.net/docs/package_reference/sparse_encoder/losses.html#spladeloss) with these parameters:
	```json
	{
	"loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)",
	"document_regularizer_weight": 3e-05,
	"query_regularizer_weight": 5e-05
	}
	```

	### Training Hyperparameters
	#### Non-Default Hyperparameters

	- `per_device_train_batch_size`: 32
	- `learning_rate`: 2e-05
	- `num_train_epochs`: 1
	- `warmup_ratio`: 0.1
	- `fp16`: True
	- `batch_sampler`: no_duplicates
	- `router_mapping`: {'anchor': 'query', 'positive': 'document'}

	#### All Hyperparameters
	<details><summary>Click to expand</summary>

	- `overwrite_output_dir`: False
	- `do_predict`: False
	- `eval_strategy`: no
	- `prediction_loss_only`: True
	- `per_device_train_batch_size`: 32
	- `per_device_eval_batch_size`: 8
	- `per_gpu_train_batch_size`: None
	- `per_gpu_eval_batch_size`: None
	- `gradient_accumulation_steps`: 1
	- `eval_accumulation_steps`: None
	- `torch_empty_cache_steps`: None
	- `learning_rate`: 2e-05
	- `weight_decay`: 0.0
	- `adam_beta1`: 0.9
	- `adam_beta2`: 0.999
	- `adam_epsilon`: 1e-08
	- `max_grad_norm`: 1.0
	- `num_train_epochs`: 1
	- `max_steps`: -1
	- `lr_scheduler_type`: linear
	- `lr_scheduler_kwargs`: {}
	- `warmup_ratio`: 0.1
	- `warmup_steps`: 0
	- `log_level`: passive
	- `log_level_replica`: warning
	- `log_on_each_node`: True
	- `logging_nan_inf_filter`: True
	- `save_safetensors`: True
	- `save_on_each_node`: False
	- `save_only_model`: False
	- `restore_callback_states_from_checkpoint`: False
	- `no_cuda`: False
	- `use_cpu`: False
	- `use_mps_device`: False
	- `seed`: 42
	- `data_seed`: None
	- `jit_mode_eval`: False
	- `bf16`: False
	- `fp16`: True
	- `fp16_opt_level`: O1
	- `half_precision_backend`: auto
	- `bf16_full_eval`: False
	- `fp16_full_eval`: False
	- `tf32`: None
	- `local_rank`: 0
	- `ddp_backend`: None
	- `tpu_num_cores`: None
	- `tpu_metrics_debug`: False
	- `debug`: []
	- `dataloader_drop_last`: False
	- `dataloader_num_workers`: 0
	- `dataloader_prefetch_factor`: None
	- `past_index`: -1
	- `disable_tqdm`: False
	- `remove_unused_columns`: True
	- `label_names`: None
	- `load_best_model_at_end`: False
	- `ignore_data_skip`: False
	- `fsdp`: []
	- `fsdp_min_num_params`: 0
	- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
	- `fsdp_transformer_layer_cls_to_wrap`: None
	- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
	- `parallelism_config`: None
	- `deepspeed`: None
	- `label_smoothing_factor`: 0.0
	- `optim`: adamw_torch_fused
	- `optim_args`: None
	- `adafactor`: False
	- `group_by_length`: False
	- `length_column_name`: length
	- `project`: huggingface
	- `trackio_space_id`: trackio
	- `ddp_find_unused_parameters`: None
	- `ddp_bucket_cap_mb`: None
	- `ddp_broadcast_buffers`: False
	- `dataloader_pin_memory`: True
	- `dataloader_persistent_workers`: False
	- `skip_memory_metrics`: True
	- `use_legacy_prediction_loop`: False
	- `push_to_hub`: False
	- `resume_from_checkpoint`: None
	- `hub_model_id`: None
	- `hub_strategy`: every_save
	- `hub_private_repo`: None
	- `hub_always_push`: False
	- `hub_revision`: None
	- `gradient_checkpointing`: False
	- `gradient_checkpointing_kwargs`: None
	- `include_inputs_for_metrics`: False
	- `include_for_metrics`: []
	- `eval_do_concat_batches`: True
	- `fp16_backend`: auto
	- `push_to_hub_model_id`: None
	- `push_to_hub_organization`: None
	- `mp_parameters`:
	- `auto_find_batch_size`: False
	- `full_determinism`: False
	- `torchdynamo`: None
	- `ray_scope`: last
	- `ddp_timeout`: 1800
	- `torch_compile`: False
	- `torch_compile_backend`: None
	- `torch_compile_mode`: None
	- `include_tokens_per_second`: False
	- `include_num_input_tokens_seen`: no
	- `neftune_noise_alpha`: None
	- `optim_target_modules`: None
	- `batch_eval_metrics`: False
	- `eval_on_start`: False
	- `use_liger_kernel`: False
	- `liger_kernel_config`: None
	- `eval_use_gather_object`: False
	- `average_tokens_across_devices`: True
	- `prompts`: None
	- `batch_sampler`: no_duplicates
	- `multi_dataset_batch_sampler`: proportional
	- `router_mapping`: {'anchor': 'query', 'positive': 'document'}
	- `learning_rate_mapping`: {}

	</details>

	### Training Logs
	\| Epoch \| Step \| Training Loss \|
	\|:------:\|:----:\|:-------------:\|
	\| 0.0321 \| 100 \| 329.7303 \|
	\| 0.0642 \| 200 \| 1.9189 \|
	\| 0.0963 \| 300 \| 0.4059 \|
	\| 0.1284 \| 400 \| 0.3173 \|
	\| 0.1605 \| 500 \| 0.2776 \|
	\| 0.1926 \| 600 \| 0.2812 \|
	\| 0.2246 \| 700 \| 0.2648 \|
	\| 0.2567 \| 800 \| 0.2821 \|
	\| 0.2888 \| 900 \| 0.254 \|
	\| 0.3209 \| 1000 \| 0.2789 \|
	\| 0.3530 \| 1100 \| 0.2163 \|
	\| 0.3851 \| 1200 \| 0.2375 \|
	\| 0.4172 \| 1300 \| 0.2165 \|
	\| 0.4493 \| 1400 \| 0.2254 \|
	\| 0.4814 \| 1500 \| 0.2105 \|
	\| 0.5135 \| 1600 \| 0.2147 \|
	\| 0.5456 \| 1700 \| 0.2468 \|
	\| 0.5777 \| 1800 \| 0.2438 \|
	\| 0.6098 \| 1900 \| 0.209 \|
	\| 0.6418 \| 2000 \| 0.2327 \|
	\| 0.6739 \| 2100 \| 0.2475 \|
	\| 0.7060 \| 2200 \| 0.227 \|
	\| 0.7381 \| 2300 \| 0.1992 \|
	\| 0.7702 \| 2400 \| 0.2258 \|
	\| 0.8023 \| 2500 \| 0.1676 \|
	\| 0.8344 \| 2600 \| 0.2081 \|
	\| 0.8665 \| 2700 \| 0.1966 \|
	\| 0.8986 \| 2800 \| 0.218 \|
	\| 0.9307 \| 2900 \| 0.1998 \|
	\| 0.9628 \| 3000 \| 0.2157 \|
	\| 0.9949 \| 3100 \| 0.2011 \|


	### Framework Versions
	- Python: 3.11.10
	- Sentence Transformers: 5.2.0
	- Transformers: 4.57.3
	- PyTorch: 2.9.1+cu128
	- Accelerate: 1.12.0
	- Datasets: 4.4.1
	- Tokenizers: 0.22.1

	## Citation

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### SpladeLoss
	```bibtex
	@misc{formal2022distillationhardnegativesampling,
	title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
	author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
	year={2022},
	eprint={2205.04733},
	archivePrefix={arXiv},
	primaryClass={cs.IR},
	url={https://arxiv.org/abs/2205.04733},
	}
	```

	#### SparseMultipleNegativesRankingLoss
	```bibtex
	@misc{henderson2017efficient,
	title={Efficient Natural Language Response Suggestion for Smart Reply},
	author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
	year={2017},
	eprint={1705.00652},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	#### FlopsLoss
	```bibtex
	@article{paria2020minimizing,
	title={Minimizing flops to learn efficient sparse representations},
	author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
	journal={arXiv preprint arXiv:2004.05665},
	year={2020}
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->