---
license: mit
language:
- zh
- en
datasets:
- IowaCat/page_type_inference_dataset
metrics:
- accuracy: 0.99
pipeline_tag: text-generation
tags:
- url-classification
- list-page-detection
- detail-page-detection
- qwen
- fine-tuning
- lora
- url-parser
widget:
- text: "https://example.com/product/12345"
- text: "https://example.com/category/electronics"
---

# URL Page Type Classifier

<div align="center">

![Model Size](https://img.shields.io/badge/Model%20Size-1.5B-blue)
![License](https://img.shields.io/badge/License-MIT-green)
![Accuracy](https://img.shields.io/badge/Accuracy-100%25-green)

</div>

## 📋 概述

基于 Qwen2.5-1.5B + LoRA 微调的URL类型分类模型，用于判断URL是列表页还是详情页。

## 🏗️ 模型架构

| 项目 | 详情 |
|------|------|
| **基础模型** | Qwen/Qwen2.5-1.5B |
| **微调方法** | LoRA (r=16, alpha=32) |
| **参数量** | 1.5B |
| **可训练参数** | ~18M (1.18%) |

## 📊 训练数据

- **数据集**: IowaCat/page_type_inference_dataset
- **训练样本**: 10,000条URL (5000列表页 + 5000详情页)
- **数据来源**: HuggingFace Datasets

### 数据分布

| 类型 | 数量 | 比例 |
|------|------|------|
| 列表页 (List Page) | 5,000 | 50% |
| 详情页 (Detail Page) | 5,000 | 50% |

## ⚙️ 训练配置

```python
{
    "base_model": "Qwen/Qwen2.5-1.5B",
    "lora_rank": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "num_train_epochs": 3,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 8,
    "learning_rate": 2e-4,
    "fp16": true,
    "optimizer": "adamw_torch",
    "lr_scheduler_type": "cosine"
}
```

## 📈 性能评估

### 测试结果

| 测试集 | 样本数 | 准确率 |
|--------|--------|--------|
| 验证集 | 100 | **99%** |

### 示例预测

| URL | 预测结果 |
|-----|----------|
| `https://example.com/products/category` | 列表页 (List Page) |
| `https://example.com/product/12345` | 详情页 (Detail Page) |
| `https://example.com/search?q=test` | 列表页 (List Page) |
| `https://example.com/item/abc123` | 详情页 (Detail Page) |
| `https://example.com/list/all` | 列表页 (List Page) |

## 🚀 快速开始

### 安装依赖

```bash
pip install transformers peft torch
```

### 推理代码

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "windlx/url-classifier-model"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# 要分类的URL
url = "https://example.com/product/12345"

# 构建提示
prompt = f"""请判断以下URL是列表页还是详情页。

URL: {url}
类型: """

# 推理
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10, do_sample=False)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# 提取结果
if "详情页" in response or "Detail Page" in response:
    result = "详情页 (Detail Page)"
else:
    result = "列表页 (List Page)"

print(f"URL: {url}")
print(f"类型: {result}")
```

### 使用 GPU

```python
# 自动使用GPU
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    trust_remote_code=True,
    device_map="auto",
    torch_dtype="auto"
)
```

### 使用 CPU

```python
# 强制使用CPU
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    trust_remote_code=True,
    device_map="cpu",
    torch_dtype="float32"
)
```

## ⚠️ 局限性

1. **仅基于URL字符串** - 不访问实际网页内容
2. **依赖URL路径规范** - 对于URL路径不规范的网站，准确率可能较低
3. **仅支持中英文** - 主要针对中文URL优化

## 📝 使用场景

- 🔍 **搜索引擎优化 (SEO)** - 识别网站页面结构
- 🕷️ **网页爬虫** - 判断链接类型，优化爬取策略
- 📊 **网站分析** - 统计列表页和详情页比例
- 🔗 **链接分类** - 大规模URL分类处理

## 📁 相关链接

- **GitHub仓库**: https://github.com/xiuxiu/url-classifier
- **HuggingFace模型**: https://huggingface.co/windlx/url-classifier-model
- **训练数据集**: https://huggingface.co/datasets/IowaCat/page_type_inference_dataset

## 🙏 致谢

- [Qwen](https://github.com/QwenLM/Qwen2) - 提供基础模型
- [LoRA](https://github.com/microsoft/LoRA) - 高效微调方法
- [HuggingFace](https://huggingface.co/) - 模型托管平台

## 📄 许可

[LICENSE](LICENSE) - MIT License