windlx
/

url-classifier

url-classification

binary-classification

Model card Files Files and versions

url-classifier / README.md

windlx's picture

Add model card

095e819 verified 22 days ago

|

history blame contribute delete

2.21 kB

	---
	language:
	- en
	- zh
	license: mit
	tags:
	- url-classification
	- binary-classification
	- autoresearch
	- multi-domain
	metrics:
	- accuracy
	model_index:
	- name: url-classifier-v2
	results:
	- task:
	type: text-classification
	name: URL Binary Classification (Multi-Domain)
	dataset:
	type: "synthetic-diverse (26 domains)"
	name: URL Classification Diverse Dataset
	metrics:
	- type: accuracy
	value: 1.0000
	---

	# URL Classifier v2 — Autoresearch (Multi-Domain)

	Binary classifier that predicts whether a URL is a list page (A) or a detail page (B).

	Trained on 26 diverse domains across e-commerce, recruitment, news, social, video, travel, education, and tech documentation — significantly improved generalization over the v1 single-domain model.

	## Model Details

	- Architecture: Custom transformer (Autoresearch framework)
	- Parameters: ~161M
	- Depth: 4 layers
	- Model dim: 384
	- Vocab: cl100k_base (100,277 tokens)
	- Max seq len: 64
	- Training: 30 min on RTX 4060 Laptop
	- Training samples: 2,600 (A=1,300, B=1,300)
	- Training accuracy: 100%

	## Supported Domains

	\| Category \| Domains \|
	\|----------\|---------\|
	\| E-commerce \| Amazon, JD, Taobao, Tmall, Pinduoduo \|
	\| Recruitment \| Zhilian, BOSS, Lagou \|
	\| News \| Sina, NetEase, Tencent News, 36kr \|
	\| Social \| Zhihu, Douban, Xiaohongshu, Reddit \|
	\| Video \| YouTube, Bilibili \|
	\| Travel \| Ctrip, Qunar, Mafengwo \|
	\| Education \| icourse163, imooc \|
	\| Tech Docs \| GitHub, ReadTheDocs, MDN \|

	## Usage

	```bash
	pip install torch tiktoken
	python src/infer.py "https://example.com/product/123" # detail page
	python src/infer.py "https://example.com/search?q=foo" # list page
	```

	## Class Labels

	\| Label \| Meaning \|
	\|-------\|---------\|
	\| 0 (A) \| List page — search results, category pages, rankings \|
	\| 1 (B) \| Detail page — product page, article, profile, video \|

	## Limitations

	- Bilibili ranking pages may be misclassified as detail pages
	- Very short URLs or URL shorteners may have lower accuracy
	- Third-party evaluation accuracy (~55%) indicates room for improvement with real-world labeled data