File size: 606 Bytes
836f1f5
64ac5b9
836f1f5
64ac5b9
836f1f5
64ac5b9
836f1f5
64ac5b9
 
 
836f1f5
64ac5b9
836f1f5
64ac5b9
836f1f5
64ac5b9
836f1f5
64ac5b9
 
 
 
 
836f1f5
64ac5b9
 
 
 
 
836f1f5
64ac5b9
836f1f5
64ac5b9
836f1f5
64ac5b9
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

---

## 📊 Training Data

The model was trained on a large-scale news URL dataset:

- Source: Infini News Corpus
- Dataset: ruggsea/infini-news-corpus
- URLs extracted from real-world news websites

---

## 🏷️ Labeling Strategy

Since manual labeling is expensive, weak supervision rules were used:

### Content pages:
- Deep URL paths
- Article-like slugs
- Presence of IDs or long titles
- News/story patterns

### Section pages:
- Short paths
- Category URLs
- Homepage or listing pages
- Trailing slash URLs

---

## ⚙️ Usage

### Install dependencies
```bash
pip install transformers torch