eCommerce Query Volume Classifier

A fine-tuned DeBERTa v3 base model that predicts the search volume class of ecommerce product queries. Trained on 39.6 million unique queries from the Amazon Shopping Queries dataset spanning 395.5 million search sessions.

Blog post: Is Query Length a Reliable Predictor of Search Volume?

Model Description

This model classifies ecommerce search queries into five volume tiers based on their expected search popularity:

Label	Class	Occurrences	Description
0	`very_high`	10,000+	Head terms, major brands (e.g. "airpods", "laptop")
1	`high`	1,000–9,999	Popular product categories and well-known items
2	`medium`	100–999	Moderately specific queries
3	`low`	10–99	Niche or qualified queries
4	`very_low`	<10	Long-tail, highly specific queries

The model learns semantic signals — brand recognition, category head terms, specificity markers — rather than superficial features like query length. Simple character/word-count heuristics achieve only ~25% accuracy on this task (barely above the 20% random baseline), while this model achieves 72.1% accuracy.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "dejanseo/ecommerce-query-volume-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

labels = ["very_high", "high", "medium", "low", "very_low"]

queries = [
    "airpods",
    "wireless mouse",
    "organic flurb capsules",
    "replacement gasket for instant pot duo 8 quart",
]

inputs = tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=32)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    preds = torch.argmax(probs, dim=-1)

for query, pred, prob in zip(queries, preds, probs):
    label = labels[pred.item()]
    confidence = prob[pred.item()].item() * 100
    print(f"{query:50s} → {label:>10s}  ({confidence:.1f}%)")

Performance

Evaluation (25K balanced sample, 5K per class)

Method	Accuracy	Spearman ρ
This model	72.1%	0.896
Word count heuristic	25.4%	-0.345
Char count heuristic	24.9%	-0.336

Per-Class F1 Scores (best validation checkpoint)

Class	Precision	Recall	F1
very_high	0.892	0.980	0.934
high	0.727	0.921	0.813
medium	0.625	0.790	0.698
low	0.496	0.335	0.400
very_low	0.610	0.579	0.594

The model performs best on the extremes (very high and very low volume) and struggles most with the low class, which sits in an ambiguous zone between medium and very_low.

Training Details

Hyperparameters

Parameter	Value
Base model	`microsoft/deberta-v3-base`
Epochs	20
Batch size	128
Learning rate	3e-5
Max sequence length	32
Warmup ratio	0.1
Weight decay	0.01
Label smoothing	0.1
Scheduler	Linear with warmup

Sampling Strategy

Balanced sampling per epoch with different random seeds:

Class	Samples per epoch
very_low	100,000
low	100,000
medium	100,000
high	30,000
very_high	30,000

Total per epoch: 324,000 train / 36,000 validation

Training Curves

Validation Curves

Hardware

GPU: NVIDIA GeForce RTX 4090 (24 GB)
RAM: 128 GB
OS: Windows 11
Training time: ~2 hours 16 minutes
Framework: PyTorch + Transformers 4.57.1

Dataset

Amazon Shopping Queries (AmazonQAC) — 395.5 million sessions, 39.6 million unique queries. Volume classes derived from raw occurrence counts across sessions.

Class	Unique Queries
very_high	~18K
high	~30K
medium	~321K
low	~4.6M
very_low	~34.7M

What the Model Learns

The model captures semantic patterns rather than surface-level features like query length:

Brand recognition: "airpods" → very high, regardless of character count
Category head terms: "laptop", "headphones", "dog food" → recognized as high-volume entry points
Specificity markers: Size specs, compatibility constraints, and material callouts signal niche demand
Nonsense detection: Gibberish queries like "blorf" and "wireless blorf adapter" are correctly classified as very low volume, confirming the model isn't just counting characters

Limitations

Trained exclusively on Amazon product search queries — may not generalize well to Google web search, informational queries, or non-English markets
The low volume class is the weakest (F1 ≈ 0.39), reflecting genuine ambiguity in the boundary between medium and very low volume queries
Volume thresholds are based on the Amazon QAC dataset's session counts, which may not map directly to other volume scales (e.g. Google Keyword Planner)
Product trends shift over time; queries that were high volume in the training data may not remain so

Citation

@article{petrovic2026querylength,
  title={Is Query Length a Reliable Predictor of Search Volume?},
  author={Petrovic, Dan},
  year={2026},
  month={March},
  url={https://dejan.ai/blog/query-length-vs-volume/}
}

Author

Dan Petrovic — DEJAN AI

Downloads last month: 36

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for dejanseo/ecommerce-query-volume-classifier

Base model

microsoft/deberta-v3-base

Finetuned

(643)

this model

Dataset used to train dejanseo/ecommerce-query-volume-classifier

Space using dejanseo/ecommerce-query-volume-classifier 1

Evaluation results

Accuracy on Amazon Shopping Queries (AmazonQAC)
self-reported

0.721
Macro F1 on Amazon Shopping Queries (AmazonQAC)
self-reported

0.688
Spearman Correlation on Amazon Shopping Queries (AmazonQAC)
self-reported

0.896