windlx
/

url-classifier-model

Text Generation

url-classification

list-page-detection

detail-page-detection

Model card Files Files and versions

windlx commited on 19 days ago

Commit

c9bf374

·

verified ·

1 Parent(s): 34cf0b2

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +70 -0

README.md ADDED Viewed

	@@ -0,0 +1,70 @@

+---
+license: mit
+tags:
+- url-classification
+- list-page-detection
+- detail-page-detection
+- qwen
+- fine-tuning
+- lor
+widget:
+- text: "https://example.com/product/12345"
+---
+# URL Page Type Classifier
+基于 Qwen2.5-1.5B + LoRA 的URL类型分类模型，用于判断URL是列表页还是详情页。
+## 模型详情
+- **基础模型**: Qwen/Qwen2.5-1.5B
+- **微调方法**: LoRA
+- **训练数据**: IowaCat/page_type_inference_dataset (10,000条URL)
+- **训练环境**: NVIDIA RTX 4060 Laptop GPU
+## 功能
+判断URL是:
+- **列表页 (List Page)** - 如 `/products`, `/category`, `/search`
+- **详情页 (Detail Page)** - 如 `/product/12345`, `/item/abc`
+## 使用方法
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_name = "windlx/url-classifier-model"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
+url = "https://example.com/product/12345"
+prompt = f"请判断以下URL是列表页还是详情页。\n\nURL: {url}\n类型: "
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=10)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+## 测试结果
+| 测试集 | 准确率 |
+|--------|--------|
+| 100条验证集 | 99% |
+## 训练配置
+- 模型: Qwen2.5-1.5B
+- LoRA rank: 16
+- Epochs: 3
+- Batch size: 2
+- Learning rate: 2e-4
+## 局限性
+- 仅支持URL字符串分类，不访问实际网页内容
+- 对于URL路径不规范的网站，准确率可能较低
+## 许可
+MIT License