roberta-query-router-ecommerce / README.md

Update README.md

9557978 verified 5 months ago

4.59 kB

	---
	license: mit
	language:
	- en
	tags:
	- roberta
	- ecommerce
	- classification
	datasets:
	- amazon-esci
	base_model:
	- FacebookAI/roberta-base
	pipeline_tag: zero-shot-classification
	library_name: transformers
	---

	E-commerce Search Query Router
	==============================

	Model Description
	-----------------

	This model is a classifier fine-tuned from a `roberta-base` to determine the optimal search strategy for e-commerce queries. It classifies a given query into one of two labels:

	- `lexical_search`: Indicates that the query is best handled by a traditional, keyword-based search engine like Lucene using BM25. These are typically specific queries like SKUs, exact product names, or part numbers.

	- `vector_search`: Indicates that the query is better suited for a semantic, vector-based search. These are often ambiguous, conceptual, or "long-tail" queries where user intent is more important than specific keywords (e.g., "a gift for my dad who likes fishing").

	The model is intended to be used as an intelligent "query router" in a hybrid search system, dynamically weighting the results from lexical and vector search engines to improve relevance.

	Intended Use & Use Case
	-----------------------

	The primary use case for this model is to power a hybrid search relevance system. The intended workflow is as follows:

	1. A user enters a search query.
	2. The query is sent to this classifier model.
	3. The model returns probabilities for the `lexical_search` and `vector_search` classes.
	4. These probabilities are used as weights to blend the relevance scores from two separate search backends (e.g., Solr and Qdrant).
	5. The final, blended scores are used to rank the products shown to the user.

	### How to Use

	Here's how to use the model with the `transformers` library `pipeline`:

	```python
	from transformers import pipeline

	router_pipeline = pipeline(
	"text-classification",
	model="timofeyk/roberta-query-router-ecommerce",
	return_all_scores=True
	)

	# Example of a conceptual query
	conceptual_query = "father day gift"
	# Example of a specific query
	specific_query = "16x16 pillow cover"

	queries = [conceptual_query, specific_query]

	for q in queries:
	print(f"Predicting label for query: {q}")
	results = router_pipeline(q)
	print(results[0])
	# Expected output might look like:
	# [{'label': 'lexical_search', 'score': 0.46258628368377686}, {'label': 'vector_search', 'score': 0.5374137163162231}]

	scores = {item['label']: item['score'] for item in results[0]}
	w_vector = scores['vector_search']
	w_lexical = scores['lexical_search']

	print(f"Vector Search Weight: {w_vector:.2f}")
	print(f"Lexical Search Weight: {w_lexical:.2f}")


	```

	Training Data
	-------------

	This model was trained on a custom dataset of anonymized, real-world e-commerce queries. The dataset was generated using Amazon ESCI Dataset as a source. The labels were generated programmatically based on search performance, creating a signal for the model to learn from:

	1. A large set of user queries and their corresponding ground-truth "ideal" product results (based on user engagement) were collected.
	2. Each query was executed against both a Solr (lexical) and a Qdrant (vector) search engine.
	3. The nDCG relevance score was calculated for both result sets against the ground truth.
	4. The query was labeled `lexical_search` if Solr achieved a higher nDCG score, and `vector_search` otherwise.

	Training Procedure
	------------------

	The model was fine-tuned using the Hugging Face `Trainer`. To account for a potential class imbalance in the training data, a custom `Trainer` with a weighted CrossEntropyLoss was used, preventing the model from favoring the majority class.

	### Training params
	```
	TrainingArguments(
	learning_rate=1e-05,
	lr_scheduler_type=SchedulerType.COSINE,
	max_grad_norm=1.0,
	num_train_epochs=3,
	optim=OptimizerNames.ADAMW_TORCH_FUSED,
	optim_args=None,
	per_device_eval_batch_size=128,
	per_device_train_batch_size=32,
	prediction_loss_only=False,
	warmup_ratio=0.05,
	weight_decay=0.01,
	)
	```

	Citation
	--------

	If you use this model in your work, please consider citing it:

	```
	@misc{timofeyk_roberta-query-router-ecommerce,
	author = {Timofey Klyubin},
	title = {E-commerce Search Query Router},
	year = {2025},
	publisher = {Hugging Face},
	journal = {Hugging Face repository},
	howpublished = {\url{[https://huggingface.co/timofeyk/roberta-query-router-ecommerce](https://huggingface.co/timofeyk/roberta-query-router-ecommerce)}}
	}
	```