---
license: other
license_name: link-attribution
license_link: https://dejan.ai/blog/query-length-vs-volume/
language:
- en
library_name: transformers
pipeline_tag: text-classification
tags:
- deberta-v2
- deberta-v3
- ecommerce
- search
- query-volume
- seo
- keyword-research
- amazon
base_model: microsoft/deberta-v3-base
datasets:
- amazon/AmazonQAC
metrics:
- accuracy
- f1
model-index:
- name: ecommerce-query-volume-classifier
  results:
  - task:
      type: text-classification
      name: Search Query Volume Classification
    dataset:
      name: Amazon Shopping Queries (AmazonQAC)
      type: amazon/AmazonQAC
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.721
    - name: Macro F1
      type: f1
      value: 0.6877
    - name: Spearman Correlation
      type: spearmanr
      value: 0.896
---

# eCommerce Query Volume Classifier

A fine-tuned [DeBERTa v3 base](https://huggingface.co/microsoft/deberta-v3-base) model that predicts the search volume class of ecommerce product queries. Trained on 39.6 million unique queries from the [Amazon Shopping Queries](https://huggingface.co/datasets/amazon/AmazonQAC) dataset spanning 395.5 million search sessions.

**Blog post:** [Is Query Length a Reliable Predictor of Search Volume?](https://dejan.ai/blog/query-length-vs-volume/)

## Model Description

This model classifies ecommerce search queries into five volume tiers based on their expected search popularity:

| Label | Class | Occurrences | Description |
|-------|-------|-------------|-------------|
| 0 | `very_high` | 10,000+ | Head terms, major brands (e.g. "airpods", "laptop") |
| 1 | `high` | 1,000–9,999 | Popular product categories and well-known items |
| 2 | `medium` | 100–999 | Moderately specific queries |
| 3 | `low` | 10–99 | Niche or qualified queries |
| 4 | `very_low` | <10 | Long-tail, highly specific queries |

The model learns semantic signals — brand recognition, category head terms, specificity markers — rather than superficial features like query length. Simple character/word-count heuristics achieve only ~25% accuracy on this task (barely above the 20% random baseline), while this model achieves **72.1% accuracy**.

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "dejanseo/ecommerce-query-volume-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

labels = ["very_high", "high", "medium", "low", "very_low"]

queries = [
    "airpods",
    "wireless mouse",
    "organic flurb capsules",
    "replacement gasket for instant pot duo 8 quart",
]

inputs = tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=32)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    preds = torch.argmax(probs, dim=-1)

for query, pred, prob in zip(queries, preds, probs):
    label = labels[pred.item()]
    confidence = prob[pred.item()].item() * 100
    print(f"{query:50s} → {label:>10s}  ({confidence:.1f}%)")
```

## Performance

### Evaluation (25K balanced sample, 5K per class)

| Method | Accuracy | Spearman ρ |
|--------|----------|------------|
| **This model** | **72.1%** | **0.896** |
| Word count heuristic | 25.4% | -0.345 |
| Char count heuristic | 24.9% | -0.336 |

### Per-Class F1 Scores (best validation checkpoint)

| Class | Precision | Recall | F1 |
|-------|-----------|--------|----|
| very_high | 0.892 | 0.980 | 0.934 |
| high | 0.727 | 0.921 | 0.813 |
| medium | 0.625 | 0.790 | 0.698 |
| low | 0.496 | 0.335 | 0.400 |
| very_low | 0.610 | 0.579 | 0.594 |

The model performs best on the extremes (very high and very low volume) and struggles most with the `low` class, which sits in an ambiguous zone between `medium` and `very_low`.

## Training Details

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Base model | `microsoft/deberta-v3-base` |
| Epochs | 20 |
| Batch size | 128 |
| Learning rate | 3e-5 |
| Max sequence length | 32 |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Label smoothing | 0.1 |
| Scheduler | Linear with warmup |

### Sampling Strategy

Balanced sampling per epoch with different random seeds:

| Class | Samples per epoch |
|-------|-------------------|
| very_low | 100,000 |
| low | 100,000 |
| medium | 100,000 |
| high | 30,000 |
| very_high | 30,000 |

**Total per epoch:** 324,000 train / 36,000 validation

### Training Curves

![Training Loss](https://huggingface.co/dejanseo/ecommerce-query-volume-classifier/resolve/main/assets/train1.png)
![Training Loss per Class](https://huggingface.co/dejanseo/ecommerce-query-volume-classifier/resolve/main/assets/train2.png)

### Validation Curves

![Validation Loss](https://huggingface.co/dejanseo/ecommerce-query-volume-classifier/resolve/main/assets/val1.png)
![Validation F1](https://huggingface.co/dejanseo/ecommerce-query-volume-classifier/resolve/main/assets/val2.png)

### Hardware

- **GPU:** NVIDIA GeForce RTX 4090 (24 GB)
- **RAM:** 128 GB
- **OS:** Windows 11
- **Training time:** ~2 hours 16 minutes
- **Framework:** PyTorch + Transformers 4.57.1

### Dataset

[Amazon Shopping Queries (AmazonQAC)](https://huggingface.co/datasets/amazon/AmazonQAC) — 395.5 million sessions, 39.6 million unique queries. Volume classes derived from raw occurrence counts across sessions.

| Class | Unique Queries |
|-------|---------------|
| very_high | ~18K |
| high | ~30K |
| medium | ~321K |
| low | ~4.6M |
| very_low | ~34.7M |

## What the Model Learns

The model captures semantic patterns rather than surface-level features like query length:

- **Brand recognition:** "airpods" → very high, regardless of character count
- **Category head terms:** "laptop", "headphones", "dog food" → recognized as high-volume entry points
- **Specificity markers:** Size specs, compatibility constraints, and material callouts signal niche demand
- **Nonsense detection:** Gibberish queries like "blorf" and "wireless blorf adapter" are correctly classified as very low volume, confirming the model isn't just counting characters

## Limitations

- Trained exclusively on Amazon product search queries — may not generalize well to Google web search, informational queries, or non-English markets
- The `low` volume class is the weakest (F1 ≈ 0.39), reflecting genuine ambiguity in the boundary between medium and very low volume queries
- Volume thresholds are based on the Amazon QAC dataset's session counts, which may not map directly to other volume scales (e.g. Google Keyword Planner)
- Product trends shift over time; queries that were high volume in the training data may not remain so

## Citation

```bibtex
@article{petrovic2026querylength,
  title={Is Query Length a Reliable Predictor of Search Volume?},
  author={Petrovic, Dan},
  year={2026},
  month={March},
  url={https://dejan.ai/blog/query-length-vs-volume/}
}
```

## Author

**Dan Petrovic** — [DEJAN AI](https://dejan.ai/)