url-classifier / README.md
windlx's picture
Add model card
095e819 verified
---
language:
- en
- zh
license: mit
tags:
- url-classification
- binary-classification
- autoresearch
- multi-domain
metrics:
- accuracy
model_index:
- name: url-classifier-v2
results:
- task:
type: text-classification
name: URL Binary Classification (Multi-Domain)
dataset:
type: "synthetic-diverse (26 domains)"
name: URL Classification Diverse Dataset
metrics:
- type: accuracy
value: 1.0000
---
# URL Classifier v2 — Autoresearch (Multi-Domain)
Binary classifier that predicts whether a URL is a **list page (A)** or a **detail page (B)**.
Trained on **26 diverse domains** across e-commerce, recruitment, news, social, video, travel, education, and tech documentation — significantly improved generalization over the v1 single-domain model.
## Model Details
- **Architecture**: Custom transformer (Autoresearch framework)
- **Parameters**: ~161M
- **Depth**: 4 layers
- **Model dim**: 384
- **Vocab**: cl100k_base (100,277 tokens)
- **Max seq len**: 64
- **Training**: 30 min on RTX 4060 Laptop
- **Training samples**: 2,600 (A=1,300, B=1,300)
- **Training accuracy**: 100%
## Supported Domains
| Category | Domains |
|----------|---------|
| E-commerce | Amazon, JD, Taobao, Tmall, Pinduoduo |
| Recruitment | Zhilian, BOSS, Lagou |
| News | Sina, NetEase, Tencent News, 36kr |
| Social | Zhihu, Douban, Xiaohongshu, Reddit |
| Video | YouTube, Bilibili |
| Travel | Ctrip, Qunar, Mafengwo |
| Education | icourse163, imooc |
| Tech Docs | GitHub, ReadTheDocs, MDN |
## Usage
```bash
pip install torch tiktoken
python src/infer.py "https://example.com/product/123" # detail page
python src/infer.py "https://example.com/search?q=foo" # list page
```
## Class Labels
| Label | Meaning |
|-------|---------|
| 0 (A) | List page — search results, category pages, rankings |
| 1 (B) | Detail page — product page, article, profile, video |
## Limitations
- Bilibili ranking pages may be misclassified as detail pages
- Very short URLs or URL shorteners may have lower accuracy
- Third-party evaluation accuracy (~55%) indicates room for improvement with real-world labeled data