windlx commited on
Commit
c9bf374
·
verified ·
1 Parent(s): 34cf0b2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - url-classification
5
+ - list-page-detection
6
+ - detail-page-detection
7
+ - qwen
8
+ - fine-tuning
9
+ - lor
10
+ widget:
11
+ - text: "https://example.com/product/12345"
12
+ ---
13
+
14
+ # URL Page Type Classifier
15
+
16
+ 基于 Qwen2.5-1.5B + LoRA 的URL类型分类模型,用于判断URL是列表页还是详情页。
17
+
18
+ ## 模型详情
19
+
20
+ - **基础模型**: Qwen/Qwen2.5-1.5B
21
+ - **微调方法**: LoRA
22
+ - **训练数据**: IowaCat/page_type_inference_dataset (10,000条URL)
23
+ - **训练环境**: NVIDIA RTX 4060 Laptop GPU
24
+
25
+ ## 功能
26
+
27
+ 判断URL是:
28
+ - **列表页 (List Page)** - 如 `/products`, `/category`, `/search`
29
+ - **详情页 (Detail Page)** - 如 `/product/12345`, `/item/abc`
30
+
31
+ ## 使用方法
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer, AutoModelForCausalLM
35
+
36
+ model_name = "windlx/url-classifier-model"
37
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
38
+ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
39
+
40
+ url = "https://example.com/product/12345"
41
+ prompt = f"请判断以下URL是列表页还是详情页。\n\nURL: {url}\n类型: "
42
+
43
+ inputs = tokenizer(prompt, return_tensors="pt")
44
+ outputs = model.generate(**inputs, max_new_tokens=10)
45
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
46
+ print(response)
47
+ ```
48
+
49
+ ## 测试结果
50
+
51
+ | 测试集 | 准确率 |
52
+ |--------|--------|
53
+ | 100条验证集 | 99% |
54
+
55
+ ## 训练配置
56
+
57
+ - 模型: Qwen2.5-1.5B
58
+ - LoRA rank: 16
59
+ - Epochs: 3
60
+ - Batch size: 2
61
+ - Learning rate: 2e-4
62
+
63
+ ## 局限性
64
+
65
+ - 仅支持URL字符串分类,不访问实际网页内容
66
+ - 对于URL路径不规范的网站,准确率可能较低
67
+
68
+ ## 许可
69
+
70
+ MIT License