--- license: mit language: - zh - en datasets: - IowaCat/page_type_inference_dataset metrics: - accuracy: 0.99 pipeline_tag: text-generation tags: - url-classification - list-page-detection - detail-page-detection - qwen - fine-tuning - lora - url-parser widget: - text: "https://example.com/product/12345" - text: "https://example.com/category/electronics" --- # URL Page Type Classifier
![Model Size](https://img.shields.io/badge/Model%20Size-1.5B-blue) ![License](https://img.shields.io/badge/License-MIT-green) ![Accuracy](https://img.shields.io/badge/Accuracy-100%25-green)
## 📋 概述 基于 Qwen2.5-1.5B + LoRA 微调的URL类型分类模型,用于判断URL是列表页还是详情页。 ## 🏗️ 模型架构 | 项目 | 详情 | |------|------| | **基础模型** | Qwen/Qwen2.5-1.5B | | **微调方法** | LoRA (r=16, alpha=32) | | **参数量** | 1.5B | | **可训练参数** | ~18M (1.18%) | ## 📊 训练数据 - **数据集**: IowaCat/page_type_inference_dataset - **训练样本**: 10,000条URL (5000列表页 + 5000详情页) - **数据来源**: HuggingFace Datasets ### 数据分布 | 类型 | 数量 | 比例 | |------|------|------| | 列表页 (List Page) | 5,000 | 50% | | 详情页 (Detail Page) | 5,000 | 50% | ## ⚙️ 训练配置 ```python { "base_model": "Qwen/Qwen2.5-1.5B", "lora_rank": 16, "lora_alpha": 32, "lora_dropout": 0.05, "num_train_epochs": 3, "per_device_train_batch_size": 2, "gradient_accumulation_steps": 8, "learning_rate": 2e-4, "fp16": true, "optimizer": "adamw_torch", "lr_scheduler_type": "cosine" } ``` ## 📈 性能评估 ### 测试结果 | 测试集 | 样本数 | 准确率 | |--------|--------|--------| | 验证集 | 100 | **99%** | ### 示例预测 | URL | 预测结果 | |-----|----------| | `https://example.com/products/category` | 列表页 (List Page) | | `https://example.com/product/12345` | 详情页 (Detail Page) | | `https://example.com/search?q=test` | 列表页 (List Page) | | `https://example.com/item/abc123` | 详情页 (Detail Page) | | `https://example.com/list/all` | 列表页 (List Page) | ## 🚀 快速开始 ### 安装依赖 ```bash pip install transformers peft torch ``` ### 推理代码 ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "windlx/url-classifier-model" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True) # 要分类的URL url = "https://example.com/product/12345" # 构建提示 prompt = f"""请判断以下URL是列表页还是详情页。 URL: {url} 类型: """ # 推理 inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=10, do_sample=False) response = tokenizer.decode(outputs[0], skip_special_tokens=True) # 提取结果 if "详情页" in response or "Detail Page" in response: result = "详情页 (Detail Page)" else: result = "列表页 (List Page)" print(f"URL: {url}") print(f"类型: {result}") ``` ### 使用 GPU ```python # 自动使用GPU model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, device_map="auto", torch_dtype="auto" ) ``` ### 使用 CPU ```python # 强制使用CPU model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, device_map="cpu", torch_dtype="float32" ) ``` ## ⚠️ 局限性 1. **仅基于URL字符串** - 不访问实际网页内容 2. **依赖URL路径规范** - 对于URL路径不规范的网站,准确率可能较低 3. **仅支持中英文** - 主要针对中文URL优化 ## 📝 使用场景 - 🔍 **搜索引擎优化 (SEO)** - 识别网站页面结构 - 🕷️ **网页爬虫** - 判断链接类型,优化爬取策略 - 📊 **网站分析** - 统计列表页和详情页比例 - 🔗 **链接分类** - 大规模URL分类处理 ## 📁 相关链接 - **GitHub仓库**: https://github.com/xiuxiu/url-classifier - **HuggingFace模型**: https://huggingface.co/windlx/url-classifier-model - **训练数据集**: https://huggingface.co/datasets/IowaCat/page_type_inference_dataset ## 🙏 致谢 - [Qwen](https://github.com/QwenLM/Qwen2) - 提供基础模型 - [LoRA](https://github.com/microsoft/LoRA) - 高效微调方法 - [HuggingFace](https://huggingface.co/) - 模型托管平台 ## 📄 许可 [LICENSE](LICENSE) - MIT License