| --- |
| language: |
| - en |
| - zh |
| license: mit |
| tags: |
| - url-classification |
| - binary-classification |
| - autoresearch |
| - multi-domain |
| metrics: |
| - accuracy |
| model_index: |
| - name: url-classifier-v2 |
| results: |
| - task: |
| type: text-classification |
| name: URL Binary Classification (Multi-Domain) |
| dataset: |
| type: "synthetic-diverse (26 domains)" |
| name: URL Classification Diverse Dataset |
| metrics: |
| - type: accuracy |
| value: 1.0000 |
| --- |
| |
| # URL Classifier v2 — Autoresearch (Multi-Domain) |
|
|
| Binary classifier that predicts whether a URL is a **list page (A)** or a **detail page (B)**. |
|
|
| Trained on **26 diverse domains** across e-commerce, recruitment, news, social, video, travel, education, and tech documentation — significantly improved generalization over the v1 single-domain model. |
|
|
| ## Model Details |
|
|
| - **Architecture**: Custom transformer (Autoresearch framework) |
| - **Parameters**: ~161M |
| - **Depth**: 4 layers |
| - **Model dim**: 384 |
| - **Vocab**: cl100k_base (100,277 tokens) |
| - **Max seq len**: 64 |
| - **Training**: 30 min on RTX 4060 Laptop |
| - **Training samples**: 2,600 (A=1,300, B=1,300) |
| - **Training accuracy**: 100% |
| |
| ## Supported Domains |
| |
| | Category | Domains | |
| |----------|---------| |
| | E-commerce | Amazon, JD, Taobao, Tmall, Pinduoduo | |
| | Recruitment | Zhilian, BOSS, Lagou | |
| | News | Sina, NetEase, Tencent News, 36kr | |
| | Social | Zhihu, Douban, Xiaohongshu, Reddit | |
| | Video | YouTube, Bilibili | |
| | Travel | Ctrip, Qunar, Mafengwo | |
| | Education | icourse163, imooc | |
| | Tech Docs | GitHub, ReadTheDocs, MDN | |
| |
| ## Usage |
| |
| ```bash |
| pip install torch tiktoken |
| python src/infer.py "https://example.com/product/123" # detail page |
| python src/infer.py "https://example.com/search?q=foo" # list page |
| ``` |
| |
| ## Class Labels |
| |
| | Label | Meaning | |
| |-------|---------| |
| | 0 (A) | List page — search results, category pages, rankings | |
| | 1 (B) | Detail page — product page, article, profile, video | |
| |
| ## Limitations |
| |
| - Bilibili ranking pages may be misclassified as detail pages |
| - Very short URLs or URL shorteners may have lower accuracy |
| - Third-party evaluation accuracy (~55%) indicates room for improvement with real-world labeled data |
| |