--- license: other license_name: link-attribution license_link: https://dejan.ai/blog/query-length-vs-volume/ language: - en library_name: transformers pipeline_tag: text-classification tags: - deberta-v2 - deberta-v3 - ecommerce - search - query-volume - seo - keyword-research - amazon base_model: microsoft/deberta-v3-base datasets: - amazon/AmazonQAC metrics: - accuracy - f1 model-index: - name: ecommerce-query-volume-classifier results: - task: type: text-classification name: Search Query Volume Classification dataset: name: Amazon Shopping Queries (AmazonQAC) type: amazon/AmazonQAC metrics: - name: Accuracy type: accuracy value: 0.721 - name: Macro F1 type: f1 value: 0.6877 - name: Spearman Correlation type: spearmanr value: 0.896 --- # eCommerce Query Volume Classifier A fine-tuned [DeBERTa v3 base](https://huggingface.co/microsoft/deberta-v3-base) model that predicts the search volume class of ecommerce product queries. Trained on 39.6 million unique queries from the [Amazon Shopping Queries](https://huggingface.co/datasets/amazon/AmazonQAC) dataset spanning 395.5 million search sessions. **Blog post:** [Is Query Length a Reliable Predictor of Search Volume?](https://dejan.ai/blog/query-length-vs-volume/) ## Model Description This model classifies ecommerce search queries into five volume tiers based on their expected search popularity: | Label | Class | Occurrences | Description | |-------|-------|-------------|-------------| | 0 | `very_high` | 10,000+ | Head terms, major brands (e.g. "airpods", "laptop") | | 1 | `high` | 1,000–9,999 | Popular product categories and well-known items | | 2 | `medium` | 100–999 | Moderately specific queries | | 3 | `low` | 10–99 | Niche or qualified queries | | 4 | `very_low` | <10 | Long-tail, highly specific queries | The model learns semantic signals — brand recognition, category head terms, specificity markers — rather than superficial features like query length. Simple character/word-count heuristics achieve only ~25% accuracy on this task (barely above the 20% random baseline), while this model achieves **72.1% accuracy**. ## Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "dejanseo/ecommerce-query-volume-classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) model.eval() labels = ["very_high", "high", "medium", "low", "very_low"] queries = [ "airpods", "wireless mouse", "organic flurb capsules", "replacement gasket for instant pot duo 8 quart", ] inputs = tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=32) with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=-1) preds = torch.argmax(probs, dim=-1) for query, pred, prob in zip(queries, preds, probs): label = labels[pred.item()] confidence = prob[pred.item()].item() * 100 print(f"{query:50s} → {label:>10s} ({confidence:.1f}%)") ``` ## Performance ### Evaluation (25K balanced sample, 5K per class) | Method | Accuracy | Spearman ρ | |--------|----------|------------| | **This model** | **72.1%** | **0.896** | | Word count heuristic | 25.4% | -0.345 | | Char count heuristic | 24.9% | -0.336 | ### Per-Class F1 Scores (best validation checkpoint) | Class | Precision | Recall | F1 | |-------|-----------|--------|----| | very_high | 0.892 | 0.980 | 0.934 | | high | 0.727 | 0.921 | 0.813 | | medium | 0.625 | 0.790 | 0.698 | | low | 0.496 | 0.335 | 0.400 | | very_low | 0.610 | 0.579 | 0.594 | The model performs best on the extremes (very high and very low volume) and struggles most with the `low` class, which sits in an ambiguous zone between `medium` and `very_low`. ## Training Details ### Hyperparameters | Parameter | Value | |-----------|-------| | Base model | `microsoft/deberta-v3-base` | | Epochs | 20 | | Batch size | 128 | | Learning rate | 3e-5 | | Max sequence length | 32 | | Warmup ratio | 0.1 | | Weight decay | 0.01 | | Label smoothing | 0.1 | | Scheduler | Linear with warmup | ### Sampling Strategy Balanced sampling per epoch with different random seeds: | Class | Samples per epoch | |-------|-------------------| | very_low | 100,000 | | low | 100,000 | | medium | 100,000 | | high | 30,000 | | very_high | 30,000 | **Total per epoch:** 324,000 train / 36,000 validation ### Training Curves ![Training Loss](https://huggingface.co/dejanseo/ecommerce-query-volume-classifier/resolve/main/assets/train1.png) ![Training Loss per Class](https://huggingface.co/dejanseo/ecommerce-query-volume-classifier/resolve/main/assets/train2.png) ### Validation Curves ![Validation Loss](https://huggingface.co/dejanseo/ecommerce-query-volume-classifier/resolve/main/assets/val1.png) ![Validation F1](https://huggingface.co/dejanseo/ecommerce-query-volume-classifier/resolve/main/assets/val2.png) ### Hardware - **GPU:** NVIDIA GeForce RTX 4090 (24 GB) - **RAM:** 128 GB - **OS:** Windows 11 - **Training time:** ~2 hours 16 minutes - **Framework:** PyTorch + Transformers 4.57.1 ### Dataset [Amazon Shopping Queries (AmazonQAC)](https://huggingface.co/datasets/amazon/AmazonQAC) — 395.5 million sessions, 39.6 million unique queries. Volume classes derived from raw occurrence counts across sessions. | Class | Unique Queries | |-------|---------------| | very_high | ~18K | | high | ~30K | | medium | ~321K | | low | ~4.6M | | very_low | ~34.7M | ## What the Model Learns The model captures semantic patterns rather than surface-level features like query length: - **Brand recognition:** "airpods" → very high, regardless of character count - **Category head terms:** "laptop", "headphones", "dog food" → recognized as high-volume entry points - **Specificity markers:** Size specs, compatibility constraints, and material callouts signal niche demand - **Nonsense detection:** Gibberish queries like "blorf" and "wireless blorf adapter" are correctly classified as very low volume, confirming the model isn't just counting characters ## Limitations - Trained exclusively on Amazon product search queries — may not generalize well to Google web search, informational queries, or non-English markets - The `low` volume class is the weakest (F1 ≈ 0.39), reflecting genuine ambiguity in the boundary between medium and very low volume queries - Volume thresholds are based on the Amazon QAC dataset's session counts, which may not map directly to other volume scales (e.g. Google Keyword Planner) - Product trends shift over time; queries that were high volume in the training data may not remain so ## Citation ```bibtex @article{petrovic2026querylength, title={Is Query Length a Reliable Predictor of Search Volume?}, author={Petrovic, Dan}, year={2026}, month={March}, url={https://dejan.ai/blog/query-length-vs-volume/} } ``` ## Author **Dan Petrovic** — [DEJAN AI](https://dejan.ai/)