| | --- |
| | license: mit |
| | language: |
| | - zh |
| | - en |
| | datasets: |
| | - IowaCat/page_type_inference_dataset |
| | metrics: |
| | - accuracy: 0.99 |
| | pipeline_tag: text-generation |
| | tags: |
| | - url-classification |
| | - list-page-detection |
| | - detail-page-detection |
| | - qwen |
| | - fine-tuning |
| | - lora |
| | - url-parser |
| | widget: |
| | - text: "https://example.com/product/12345" |
| | - text: "https://example.com/category/electronics" |
| | --- |
| | |
| | # URL Page Type Classifier |
| |
|
| | <div align="center"> |
| |
|
| |  |
| |  |
| |  |
| |
|
| | </div> |
| |
|
| | ## 📋 概述 |
| |
|
| | 基于 Qwen2.5-1.5B + LoRA 微调的URL类型分类模型,用于判断URL是列表页还是详情页。 |
| |
|
| | ## 🏗️ 模型架构 |
| |
|
| | | 项目 | 详情 | |
| | |------|------| |
| | | **基础模型** | Qwen/Qwen2.5-1.5B | |
| | | **微调方法** | LoRA (r=16, alpha=32) | |
| | | **参数量** | 1.5B | |
| | | **可训练参数** | ~18M (1.18%) | |
| |
|
| | ## 📊 训练数据 |
| |
|
| | - **数据集**: IowaCat/page_type_inference_dataset |
| | - **训练样本**: 10,000条URL (5000列表页 + 5000详情页) |
| | - **数据来源**: HuggingFace Datasets |
| | |
| | ### 数据分布 |
| | |
| | | 类型 | 数量 | 比例 | |
| | |------|------|------| |
| | | 列表页 (List Page) | 5,000 | 50% | |
| | | 详情页 (Detail Page) | 5,000 | 50% | |
| | |
| | ## ⚙️ 训练配置 |
| | |
| | ```python |
| | { |
| | "base_model": "Qwen/Qwen2.5-1.5B", |
| | "lora_rank": 16, |
| | "lora_alpha": 32, |
| | "lora_dropout": 0.05, |
| | "num_train_epochs": 3, |
| | "per_device_train_batch_size": 2, |
| | "gradient_accumulation_steps": 8, |
| | "learning_rate": 2e-4, |
| | "fp16": true, |
| | "optimizer": "adamw_torch", |
| | "lr_scheduler_type": "cosine" |
| | } |
| | ``` |
| | |
| | ## 📈 性能评估 |
| |
|
| | ### 测试结果 |
| |
|
| | | 测试集 | 样本数 | 准确率 | |
| | |--------|--------|--------| |
| | | 验证集 | 100 | **99%** | |
| |
|
| | ### 示例预测 |
| |
|
| | | URL | 预测结果 | |
| | |-----|----------| |
| | | `https://example.com/products/category` | 列表页 (List Page) | |
| | | `https://example.com/product/12345` | 详情页 (Detail Page) | |
| | | `https://example.com/search?q=test` | 列表页 (List Page) | |
| | | `https://example.com/item/abc123` | 详情页 (Detail Page) | |
| | | `https://example.com/list/all` | 列表页 (List Page) | |
| |
|
| | ## 🚀 快速开始 |
| |
|
| | ### 安装依赖 |
| |
|
| | ```bash |
| | pip install transformers peft torch |
| | ``` |
| |
|
| | ### 推理代码 |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | model_name = "windlx/url-classifier-model" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True) |
| | |
| | # 要分类的URL |
| | url = "https://example.com/product/12345" |
| | |
| | # 构建提示 |
| | prompt = f"""请判断以下URL是列表页还是详情页。 |
| | |
| | URL: {url} |
| | 类型: """ |
| | |
| | # 推理 |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=10, do_sample=False) |
| | response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | |
| | # 提取结果 |
| | if "详情页" in response or "Detail Page" in response: |
| | result = "详情页 (Detail Page)" |
| | else: |
| | result = "列表页 (List Page)" |
| | |
| | print(f"URL: {url}") |
| | print(f"类型: {result}") |
| | ``` |
| |
|
| | ### 使用 GPU |
| |
|
| | ```python |
| | # 自动使用GPU |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | trust_remote_code=True, |
| | device_map="auto", |
| | torch_dtype="auto" |
| | ) |
| | ``` |
| |
|
| | ### 使用 CPU |
| |
|
| | ```python |
| | # 强制使用CPU |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | trust_remote_code=True, |
| | device_map="cpu", |
| | torch_dtype="float32" |
| | ) |
| | ``` |
| |
|
| | ## ⚠️ 局限性 |
| |
|
| | 1. **仅基于URL字符串** - 不访问实际网页内容 |
| | 2. **依赖URL路径规范** - 对于URL路径不规范的网站,准确率可能较低 |
| | 3. **仅支持中英文** - 主要针对中文URL优化 |
| |
|
| | ## 📝 使用场景 |
| |
|
| | - 🔍 **搜索引擎优化 (SEO)** - 识别网站页面结构 |
| | - 🕷️ **网页爬虫** - 判断链接类型,优化爬取策略 |
| | - 📊 **网站分析** - 统计列表页和详情页比例 |
| | - 🔗 **链接分类** - 大规模URL分类处理 |
| |
|
| | ## 📁 相关链接 |
| |
|
| | - **GitHub仓库**: https://github.com/xiuxiu/url-classifier |
| | - **HuggingFace模型**: https://huggingface.co/windlx/url-classifier-model |
| | - **训练数据集**: https://huggingface.co/datasets/IowaCat/page_type_inference_dataset |
| | |
| | ## 🙏 致谢 |
| | |
| | - [Qwen](https://github.com/QwenLM/Qwen2) - 提供基础模型 |
| | - [LoRA](https://github.com/microsoft/LoRA) - 高效微调方法 |
| | - [HuggingFace](https://huggingface.co/) - 模型托管平台 |
| | |
| | ## 📄 许可 |
| | |
| | [LICENSE](LICENSE) - MIT License |
| | |