|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- roberta |
|
|
- ecommerce |
|
|
- classification |
|
|
datasets: |
|
|
- amazon-esci |
|
|
base_model: |
|
|
- FacebookAI/roberta-base |
|
|
pipeline_tag: zero-shot-classification |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
E-commerce Search Query Router |
|
|
============================== |
|
|
|
|
|
Model Description |
|
|
----------------- |
|
|
|
|
|
This model is a classifier fine-tuned from a `roberta-base` to determine the optimal search strategy for e-commerce queries. It classifies a given query into one of two labels: |
|
|
|
|
|
- **`lexical_search`**: Indicates that the query is best handled by a traditional, keyword-based search engine like Lucene using BM25. These are typically specific queries like SKUs, exact product names, or part numbers. |
|
|
|
|
|
- **`vector_search`**: Indicates that the query is better suited for a semantic, vector-based search. These are often ambiguous, conceptual, or "long-tail" queries where user intent is more important than specific keywords (e.g., "a gift for my dad who likes fishing"). |
|
|
|
|
|
The model is intended to be used as an intelligent "query router" in a hybrid search system, dynamically weighting the results from lexical and vector search engines to improve relevance. |
|
|
|
|
|
Intended Use & Use Case |
|
|
----------------------- |
|
|
|
|
|
The primary use case for this model is to power a hybrid search relevance system. The intended workflow is as follows: |
|
|
|
|
|
1. A user enters a search query. |
|
|
2. The query is sent to this classifier model. |
|
|
3. The model returns probabilities for the `lexical_search` and `vector_search` classes. |
|
|
4. These probabilities are used as weights to blend the relevance scores from two separate search backends (e.g., Solr and Qdrant). |
|
|
5. The final, blended scores are used to rank the products shown to the user. |
|
|
|
|
|
### How to Use |
|
|
|
|
|
Here's how to use the model with the `transformers` library `pipeline`: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
router_pipeline = pipeline( |
|
|
"text-classification", |
|
|
model="timofeyk/roberta-query-router-ecommerce", |
|
|
return_all_scores=True |
|
|
) |
|
|
|
|
|
# Example of a conceptual query |
|
|
conceptual_query = "father day gift" |
|
|
# Example of a specific query |
|
|
specific_query = "16x16 pillow cover" |
|
|
|
|
|
queries = [conceptual_query, specific_query] |
|
|
|
|
|
for q in queries: |
|
|
print(f"Predicting label for query: {q}") |
|
|
results = router_pipeline(q) |
|
|
print(results[0]) |
|
|
# Expected output might look like: |
|
|
# [{'label': 'lexical_search', 'score': 0.46258628368377686}, {'label': 'vector_search', 'score': 0.5374137163162231}] |
|
|
|
|
|
scores = {item['label']: item['score'] for item in results[0]} |
|
|
w_vector = scores['vector_search'] |
|
|
w_lexical = scores['lexical_search'] |
|
|
|
|
|
print(f"Vector Search Weight: {w_vector:.2f}") |
|
|
print(f"Lexical Search Weight: {w_lexical:.2f}") |
|
|
|
|
|
|
|
|
``` |
|
|
|
|
|
Training Data |
|
|
------------- |
|
|
|
|
|
This model was trained on a custom dataset of anonymized, real-world e-commerce queries. The dataset was generated using Amazon ESCI Dataset as a source. The labels were generated programmatically based on search performance, creating a signal for the model to learn from: |
|
|
|
|
|
1. A large set of user queries and their corresponding ground-truth "ideal" product results (based on user engagement) were collected. |
|
|
2. Each query was executed against both a Solr (lexical) and a Qdrant (vector) search engine. |
|
|
3. The **nDCG** relevance score was calculated for both result sets against the ground truth. |
|
|
4. The query was labeled `lexical_search` if Solr achieved a higher nDCG score, and `vector_search` otherwise. |
|
|
|
|
|
Training Procedure |
|
|
------------------ |
|
|
|
|
|
The model was fine-tuned using the Hugging Face `Trainer`. To account for a potential class imbalance in the training data, a custom `Trainer` with a **weighted CrossEntropyLoss** was used, preventing the model from favoring the majority class. |
|
|
|
|
|
### Training params |
|
|
``` |
|
|
TrainingArguments( |
|
|
learning_rate=1e-05, |
|
|
lr_scheduler_type=SchedulerType.COSINE, |
|
|
max_grad_norm=1.0, |
|
|
num_train_epochs=3, |
|
|
optim=OptimizerNames.ADAMW_TORCH_FUSED, |
|
|
optim_args=None, |
|
|
per_device_eval_batch_size=128, |
|
|
per_device_train_batch_size=32, |
|
|
prediction_loss_only=False, |
|
|
warmup_ratio=0.05, |
|
|
weight_decay=0.01, |
|
|
) |
|
|
``` |
|
|
|
|
|
Citation |
|
|
-------- |
|
|
|
|
|
If you use this model in your work, please consider citing it: |
|
|
|
|
|
``` |
|
|
@misc{timofeyk_roberta-query-router-ecommerce, |
|
|
author = {Timofey Klyubin}, |
|
|
title = {E-commerce Search Query Router}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
journal = {Hugging Face repository}, |
|
|
howpublished = {\url{[https://huggingface.co/timofeyk/roberta-query-router-ecommerce](https://huggingface.co/timofeyk/roberta-query-router-ecommerce)}} |
|
|
} |
|
|
``` |