File size: 2,208 Bytes
2df0e60 095e819 2df0e60 095e819 2df0e60 095e819 2df0e60 095e819 2df0e60 095e819 2df0e60 095e819 2df0e60 095e819 2df0e60 095e819 2df0e60 095e819 2df0e60 095e819 2df0e60 095e819 2df0e60 095e819 2df0e60 095e819 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | ---
language:
- en
- zh
license: mit
tags:
- url-classification
- binary-classification
- autoresearch
- multi-domain
metrics:
- accuracy
model_index:
- name: url-classifier-v2
results:
- task:
type: text-classification
name: URL Binary Classification (Multi-Domain)
dataset:
type: "synthetic-diverse (26 domains)"
name: URL Classification Diverse Dataset
metrics:
- type: accuracy
value: 1.0000
---
# URL Classifier v2 — Autoresearch (Multi-Domain)
Binary classifier that predicts whether a URL is a **list page (A)** or a **detail page (B)**.
Trained on **26 diverse domains** across e-commerce, recruitment, news, social, video, travel, education, and tech documentation — significantly improved generalization over the v1 single-domain model.
## Model Details
- **Architecture**: Custom transformer (Autoresearch framework)
- **Parameters**: ~161M
- **Depth**: 4 layers
- **Model dim**: 384
- **Vocab**: cl100k_base (100,277 tokens)
- **Max seq len**: 64
- **Training**: 30 min on RTX 4060 Laptop
- **Training samples**: 2,600 (A=1,300, B=1,300)
- **Training accuracy**: 100%
## Supported Domains
| Category | Domains |
|----------|---------|
| E-commerce | Amazon, JD, Taobao, Tmall, Pinduoduo |
| Recruitment | Zhilian, BOSS, Lagou |
| News | Sina, NetEase, Tencent News, 36kr |
| Social | Zhihu, Douban, Xiaohongshu, Reddit |
| Video | YouTube, Bilibili |
| Travel | Ctrip, Qunar, Mafengwo |
| Education | icourse163, imooc |
| Tech Docs | GitHub, ReadTheDocs, MDN |
## Usage
```bash
pip install torch tiktoken
python src/infer.py "https://example.com/product/123" # detail page
python src/infer.py "https://example.com/search?q=foo" # list page
```
## Class Labels
| Label | Meaning |
|-------|---------|
| 0 (A) | List page — search results, category pages, rankings |
| 1 (B) | Detail page — product page, article, profile, video |
## Limitations
- Bilibili ranking pages may be misclassified as detail pages
- Very short URLs or URL shorteners may have lower accuracy
- Third-party evaluation accuracy (~55%) indicates room for improvement with real-world labeled data
|