eCommerce Query Volume Classifier

A fine-tuned DeBERTa v3 base model that predicts the search volume class of ecommerce product queries. Trained on 39.6 million unique queries from the Amazon Shopping Queries dataset spanning 395.5 million search sessions.

Blog post: Is Query Length a Reliable Predictor of Search Volume?

Model Description

This model classifies ecommerce search queries into five volume tiers based on their expected search popularity:

Label Class Occurrences Description
0 very_high 10,000+ Head terms, major brands (e.g. "airpods", "laptop")
1 high 1,000–9,999 Popular product categories and well-known items
2 medium 100–999 Moderately specific queries
3 low 10–99 Niche or qualified queries
4 very_low <10 Long-tail, highly specific queries

The model learns semantic signals — brand recognition, category head terms, specificity markers — rather than superficial features like query length. Simple character/word-count heuristics achieve only ~25% accuracy on this task (barely above the 20% random baseline), while this model achieves 72.1% accuracy.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "dejanseo/ecommerce-query-volume-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

labels = ["very_high", "high", "medium", "low", "very_low"]

queries = [
    "airpods",
    "wireless mouse",
    "organic flurb capsules",
    "replacement gasket for instant pot duo 8 quart",
]

inputs = tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=32)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    preds = torch.argmax(probs, dim=-1)

for query, pred, prob in zip(queries, preds, probs):
    label = labels[pred.item()]
    confidence = prob[pred.item()].item() * 100
    print(f"{query:50s}{label:>10s}  ({confidence:.1f}%)")

Performance

Evaluation (25K balanced sample, 5K per class)

Method Accuracy Spearman ρ
This model 72.1% 0.896
Word count heuristic 25.4% -0.345
Char count heuristic 24.9% -0.336

Per-Class F1 Scores (best validation checkpoint)

Class Precision Recall F1
very_high 0.892 0.980 0.934
high 0.727 0.921 0.813
medium 0.625 0.790 0.698
low 0.496 0.335 0.400
very_low 0.610 0.579 0.594

The model performs best on the extremes (very high and very low volume) and struggles most with the low class, which sits in an ambiguous zone between medium and very_low.

Training Details

Hyperparameters

Parameter Value
Base model microsoft/deberta-v3-base
Epochs 20
Batch size 128
Learning rate 3e-5
Max sequence length 32
Warmup ratio 0.1
Weight decay 0.01
Label smoothing 0.1
Scheduler Linear with warmup

Sampling Strategy

Balanced sampling per epoch with different random seeds:

Class Samples per epoch
very_low 100,000
low 100,000
medium 100,000
high 30,000
very_high 30,000

Total per epoch: 324,000 train / 36,000 validation

Training Curves

Training Loss Training Loss per Class

Validation Curves

Validation Loss Validation F1

Hardware

  • GPU: NVIDIA GeForce RTX 4090 (24 GB)
  • RAM: 128 GB
  • OS: Windows 11
  • Training time: ~2 hours 16 minutes
  • Framework: PyTorch + Transformers 4.57.1

Dataset

Amazon Shopping Queries (AmazonQAC) — 395.5 million sessions, 39.6 million unique queries. Volume classes derived from raw occurrence counts across sessions.

Class Unique Queries
very_high ~18K
high ~30K
medium ~321K
low ~4.6M
very_low ~34.7M

What the Model Learns

The model captures semantic patterns rather than surface-level features like query length:

  • Brand recognition: "airpods" → very high, regardless of character count
  • Category head terms: "laptop", "headphones", "dog food" → recognized as high-volume entry points
  • Specificity markers: Size specs, compatibility constraints, and material callouts signal niche demand
  • Nonsense detection: Gibberish queries like "blorf" and "wireless blorf adapter" are correctly classified as very low volume, confirming the model isn't just counting characters

Limitations

  • Trained exclusively on Amazon product search queries — may not generalize well to Google web search, informational queries, or non-English markets
  • The low volume class is the weakest (F1 ≈ 0.39), reflecting genuine ambiguity in the boundary between medium and very low volume queries
  • Volume thresholds are based on the Amazon QAC dataset's session counts, which may not map directly to other volume scales (e.g. Google Keyword Planner)
  • Product trends shift over time; queries that were high volume in the training data may not remain so

Citation

@article{petrovic2026querylength,
  title={Is Query Length a Reliable Predictor of Search Volume?},
  author={Petrovic, Dan},
  year={2026},
  month={March},
  url={https://dejan.ai/blog/query-length-vs-volume/}
}

Author

Dan PetrovicDEJAN AI

Downloads last month
25
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dejanseo/ecommerce-query-volume-classifier

Finetuned
(541)
this model

Dataset used to train dejanseo/ecommerce-query-volume-classifier

Space using dejanseo/ecommerce-query-volume-classifier 1

Evaluation results

  • Accuracy on Amazon Shopping Queries (AmazonQAC)
    self-reported
    0.721
  • Macro F1 on Amazon Shopping Queries (AmazonQAC)
    self-reported
    0.688
  • Spearman Correlation on Amazon Shopping Queries (AmazonQAC)
    self-reported
    0.896