eCommerce Query Volume Classifier
A fine-tuned DeBERTa v3 base model that predicts the search volume class of ecommerce product queries. Trained on 39.6 million unique queries from the Amazon Shopping Queries dataset spanning 395.5 million search sessions.
Blog post: Is Query Length a Reliable Predictor of Search Volume?
Model Description
This model classifies ecommerce search queries into five volume tiers based on their expected search popularity:
| Label | Class | Occurrences | Description |
|---|---|---|---|
| 0 | very_high |
10,000+ | Head terms, major brands (e.g. "airpods", "laptop") |
| 1 | high |
1,000–9,999 | Popular product categories and well-known items |
| 2 | medium |
100–999 | Moderately specific queries |
| 3 | low |
10–99 | Niche or qualified queries |
| 4 | very_low |
<10 | Long-tail, highly specific queries |
The model learns semantic signals — brand recognition, category head terms, specificity markers — rather than superficial features like query length. Simple character/word-count heuristics achieve only ~25% accuracy on this task (barely above the 20% random baseline), while this model achieves 72.1% accuracy.
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "dejanseo/ecommerce-query-volume-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
labels = ["very_high", "high", "medium", "low", "very_low"]
queries = [
"airpods",
"wireless mouse",
"organic flurb capsules",
"replacement gasket for instant pot duo 8 quart",
]
inputs = tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=32)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
preds = torch.argmax(probs, dim=-1)
for query, pred, prob in zip(queries, preds, probs):
label = labels[pred.item()]
confidence = prob[pred.item()].item() * 100
print(f"{query:50s} → {label:>10s} ({confidence:.1f}%)")
Performance
Evaluation (25K balanced sample, 5K per class)
| Method | Accuracy | Spearman ρ |
|---|---|---|
| This model | 72.1% | 0.896 |
| Word count heuristic | 25.4% | -0.345 |
| Char count heuristic | 24.9% | -0.336 |
Per-Class F1 Scores (best validation checkpoint)
| Class | Precision | Recall | F1 |
|---|---|---|---|
| very_high | 0.892 | 0.980 | 0.934 |
| high | 0.727 | 0.921 | 0.813 |
| medium | 0.625 | 0.790 | 0.698 |
| low | 0.496 | 0.335 | 0.400 |
| very_low | 0.610 | 0.579 | 0.594 |
The model performs best on the extremes (very high and very low volume) and struggles most with the low class, which sits in an ambiguous zone between medium and very_low.
Training Details
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | microsoft/deberta-v3-base |
| Epochs | 20 |
| Batch size | 128 |
| Learning rate | 3e-5 |
| Max sequence length | 32 |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Label smoothing | 0.1 |
| Scheduler | Linear with warmup |
Sampling Strategy
Balanced sampling per epoch with different random seeds:
| Class | Samples per epoch |
|---|---|
| very_low | 100,000 |
| low | 100,000 |
| medium | 100,000 |
| high | 30,000 |
| very_high | 30,000 |
Total per epoch: 324,000 train / 36,000 validation
Training Curves
Validation Curves
Hardware
- GPU: NVIDIA GeForce RTX 4090 (24 GB)
- RAM: 128 GB
- OS: Windows 11
- Training time: ~2 hours 16 minutes
- Framework: PyTorch + Transformers 4.57.1
Dataset
Amazon Shopping Queries (AmazonQAC) — 395.5 million sessions, 39.6 million unique queries. Volume classes derived from raw occurrence counts across sessions.
| Class | Unique Queries |
|---|---|
| very_high | ~18K |
| high | ~30K |
| medium | ~321K |
| low | ~4.6M |
| very_low | ~34.7M |
What the Model Learns
The model captures semantic patterns rather than surface-level features like query length:
- Brand recognition: "airpods" → very high, regardless of character count
- Category head terms: "laptop", "headphones", "dog food" → recognized as high-volume entry points
- Specificity markers: Size specs, compatibility constraints, and material callouts signal niche demand
- Nonsense detection: Gibberish queries like "blorf" and "wireless blorf adapter" are correctly classified as very low volume, confirming the model isn't just counting characters
Limitations
- Trained exclusively on Amazon product search queries — may not generalize well to Google web search, informational queries, or non-English markets
- The
lowvolume class is the weakest (F1 ≈ 0.39), reflecting genuine ambiguity in the boundary between medium and very low volume queries - Volume thresholds are based on the Amazon QAC dataset's session counts, which may not map directly to other volume scales (e.g. Google Keyword Planner)
- Product trends shift over time; queries that were high volume in the training data may not remain so
Citation
@article{petrovic2026querylength,
title={Is Query Length a Reliable Predictor of Search Volume?},
author={Petrovic, Dan},
year={2026},
month={March},
url={https://dejan.ai/blog/query-length-vs-volume/}
}
Author
Dan Petrovic — DEJAN AI
- Downloads last month
- 25
Model tree for dejanseo/ecommerce-query-volume-classifier
Base model
microsoft/deberta-v3-baseDataset used to train dejanseo/ecommerce-query-volume-classifier
Space using dejanseo/ecommerce-query-volume-classifier 1
Evaluation results
- Accuracy on Amazon Shopping Queries (AmazonQAC)self-reported0.721
- Macro F1 on Amazon Shopping Queries (AmazonQAC)self-reported0.688
- Spearman Correlation on Amazon Shopping Queries (AmazonQAC)self-reported0.896



