windlx
/

url-classifier-model

@@ -1,34 +1,107 @@
 ---
 license: mit
 tags:
 - url-classification
 - list-page-detection
 - detail-page-detection
 - qwen
 - fine-tuning
-- lor
 widget:
 - text: "https://example.com/product/12345"
 ---
 # URL Page Type Classifier
-基于 Qwen2.5-1.5B + LoRA 的URL类型分类模型，用于判断URL是列表页还是详情页。
-## 模型详情
-- **基础模型**: Qwen/Qwen2.5-1.5B
-- **微调方法**: LoRA
-- **训练数据**: IowaCat/page_type_inference_dataset (10,000条URL)
-- **训练环境**: NVIDIA RTX 4060 Laptop GPU
-## 功能
-判断URL是:
-- **列表页 (List Page)** - 如 `/products`, `/category`, `/search`
-- **详情页 (Detail Page)** - 如 `/product/12345`, `/item/abc`
-## 使用方法
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -37,34 +110,79 @@ model_name = "windlx/url-classifier-model"
 tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
 url = "https://example.com/product/12345"
-prompt = f"请判断以下URL是列表页还是详情页。\n\nURL: {url}\n类型: "
 inputs = tokenizer(prompt, return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=10)
 response = tokenizer.decode(outputs[0], skip_special_tokens=True)
-print(response)
 ```
-## 测试结果
-| 测试集 | 准确率 |
-|--------|--------|
-| 100条验证集 | 99% |
-## 训练配置
-- 模型: Qwen2.5-1.5B
-- LoRA rank: 16
-- Epochs: 3
-- Batch size: 2
-- Learning rate: 2e-4
-## 局限性
-- 仅支持URL字符串分类，不访问实际网页内容
-- 对于URL路径不规范的网站，准确率可能较低
-## 许可
-[LICENSE](LICENSE)

 ---
 license: mit
+language:
+- zh
+- en
+datasets:
+- IowaCat/page_type_inference_dataset
+metrics:
+- accuracy: 0.99
+pipeline_tag: text-generation
 tags:
 - url-classification
 - list-page-detection
 - detail-page-detection
 - qwen
 - fine-tuning
+- lora
+- url-parser
 widget:
 - text: "https://example.com/product/12345"
+- text: "https://example.com/category/electronics"
 ---
 # URL Page Type Classifier
+<div align="center">
+![Model Size](https://img.shields.io/badge/Model%20Size-1.5B-blue)
+![License](https://img.shields.io/badge/License-MIT-green)
+![Accuracy](https://img.shields.io/badge/Accuracy-99%25-green)
+</div>
+## 📋 概述
+基于 Qwen2.5-1.5B + LoRA 微调的URL类型分类模型，用于判断URL是列表页还是详情页。
+## 🏗️ 模型架构
+| 项目 | 详情 |
+|------|------|
+| **基础模型** | Qwen/Qwen2.5-1.5B |
+| **微调方法** | LoRA (r=16, alpha=32) |
+| **参数量** | 1.5B |
+| **可训练参数** | ~18M (1.18%) |
+## 📊 训练数据
+- **数据集**: IowaCat/page_type_inference_dataset
+- **训练样本**: 10,000条URL (5000列表页 + 5000详情页)
+- **数据来源**: HuggingFace Datasets
+### 数据分布
+| 类型 | 数量 | 比例 |
+|------|------|------|
+| 列表页 (List Page) | 5,000 | 50% |
+| 详情页 (Detail Page) | 5,000 | 50% |
+## ⚙️ 训练配置
+```python
+{
+    "base_model": "Qwen/Qwen2.5-1.5B",
+    "lora_rank": 16,
+    "lora_alpha": 32,
+    "lora_dropout": 0.05,
+    "num_train_epochs": 3,
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 8,
+    "learning_rate": 2e-4,
+    "fp16": true,
+    "optimizer": "adamw_torch",
+    "lr_scheduler_type": "cosine"
+}
+```
+## 📈 性能评估
+### 测试结果
+| 测试集 | 样本数 | 准确率 |
+|--------|--------|--------|
+| 验证集 | 100 | **99%** |
+### 示例预测
+| URL | 预测结果 |
+|-----|----------|
+| `https://example.com/products/category` | 列表页 (List Page) |
+| `https://example.com/product/12345` | 详情页 (Detail Page) |
+| `https://example.com/search?q=test` | 列表页 (List Page) |
+| `https://example.com/item/abc123` | 详情页 (Detail Page) |
+| `https://example.com/list/all` | 列表页 (List Page) |
+## 🚀 快速开始
+### 安装依赖
+```bash
+pip install transformers peft torch
+```
+### 推理代码
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
+# 要分类的URL
 url = "https://example.com/product/12345"
+# 构建提示
+prompt = f"""请判断以下URL是列表页还是详情页。
+URL: {url}
+类型: """
+# 推理
 inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=10, do_sample=False)
 response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+# 提取结果
+if "详情页" in response or "Detail Page" in response:
+    result = "详情页 (Detail Page)"
+else:
+    result = "列表页 (List Page)"
+print(f"URL: {url}")
+print(f"类型: {result}")
+```
+### 使用 GPU
+```python
+# 自动使用GPU
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    trust_remote_code=True,
+    device_map="auto",
+    torch_dtype="auto"
+)
+```
+### 使用 CPU
+```python
+# 强制使用CPU
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    trust_remote_code=True,
+    device_map="cpu",
+    torch_dtype="float32"
+)
 ```
+## ⚠️ 局限性
+1. **仅基于URL字符串** - 不访问实际网页内容
+2. **依赖URL路径规范** - 对于URL路径不规范的网站，准确率可能较低
+3. **仅支持中英文** - 主要针对中文URL优化
+## 📝 使用场景
+- 🔍 **搜索引擎优化 (SEO)** - 识别网站页面结构
+- 🕷️ **网页爬虫** - 判断链接类型，优化爬取策略
+- 📊 **网站分析** - 统计列表页和详情页比例
+- 🔗 **链接分类** - 大规模URL分类处理
+## 📁 相关链接
+- **GitHub仓库**: https://github.com/xiuxiu/url-classifier
+- **HuggingFace模型**: https://huggingface.co/windlx/url-classifier-model
+- **训练数据集**: https://huggingface.co/datasets/IowaCat/page_type_inference_dataset
+## 🙏 致谢
+- [Qwen](https://github.com/QwenLM/Qwen2) - 提供基础模型
+- [LoRA](https://github.com/microsoft/LoRA) - 高效微调方法
+- [HuggingFace](https://huggingface.co/) - 模型托管平台
+## 📄 许可
+[LICENSE](LICENSE) - MIT License