SayedShaun commited on
Commit
c8c8e84
ยท
verified ยท
1 Parent(s): bbb02ee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -29
README.md CHANGED
@@ -1,43 +1,36 @@
1
- ---
2
- language: en
3
- license: apache-2.0
4
- tags:
5
- - distilbert
6
- - fine-tuned
7
- - url-classification
8
- - text-classification
9
- - nlp
10
- - web-mining
11
- - seo
12
- datasets:
13
- - ruggsea/infini-news-corpus
14
- pipeline_tag: text-classification
15
- ---
16
 
17
- # ๐ŸŒ URL Content vs Section Classifier (Fine-tuned DistilBERT)
18
 
19
- This model is a **fine-tuned version of DistilBERT** designed to classify web URLs into two categories:
20
 
21
- - **content** โ†’ A specific page such as an article, blog post, or news story
22
- - **section** โ†’ A category page, listing page, or homepage/navigation page
23
 
24
- It is optimized for **web crawling, URL filtering, and content extraction pipelines**.
 
 
25
 
26
  ---
27
 
28
- # ๐Ÿง  Model Description
29
-
30
- This model is fine-tuned from:
31
 
32
- ๐Ÿ‘‰ **:contentReference[oaicite:0]{index=0}**
33
 
34
- DistilBERT is a compressed version of BERT that retains strong language understanding capabilities while being lightweight and fast.
 
 
 
 
35
 
36
- In this project, it is adapted to learn **structural patterns in URLs** rather than natural language sentences.
 
 
 
 
37
 
38
  ---
39
 
40
- # ๐Ÿš€ Task Definition
41
 
42
- ## Input
43
- A raw URL string:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
+ ---
3
 
4
+ ## ๐Ÿ“Š Training Data
5
 
6
+ The model was trained on a large-scale news URL dataset:
 
7
 
8
+ - Source: Infini News Corpus
9
+ - Dataset: ruggsea/infini-news-corpus
10
+ - URLs extracted from real-world news websites
11
 
12
  ---
13
 
14
+ ## ๐Ÿท๏ธ Labeling Strategy
 
 
15
 
16
+ Since manual labeling is expensive, weak supervision rules were used:
17
 
18
+ ### Content pages:
19
+ - Deep URL paths
20
+ - Article-like slugs
21
+ - Presence of IDs or long titles
22
+ - News/story patterns
23
 
24
+ ### Section pages:
25
+ - Short paths
26
+ - Category URLs
27
+ - Homepage or listing pages
28
+ - Trailing slash URLs
29
 
30
  ---
31
 
32
+ ## โš™๏ธ Usage
33
 
34
+ ### Install dependencies
35
+ ```bash
36
+ pip install transformers torch