File size: 606 Bytes
836f1f5 64ac5b9 836f1f5 64ac5b9 836f1f5 64ac5b9 836f1f5 64ac5b9 836f1f5 64ac5b9 836f1f5 64ac5b9 836f1f5 64ac5b9 836f1f5 64ac5b9 836f1f5 64ac5b9 836f1f5 64ac5b9 836f1f5 64ac5b9 836f1f5 64ac5b9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
---
## 📊 Training Data
The model was trained on a large-scale news URL dataset:
- Source: Infini News Corpus
- Dataset: ruggsea/infini-news-corpus
- URLs extracted from real-world news websites
---
## 🏷️ Labeling Strategy
Since manual labeling is expensive, weak supervision rules were used:
### Content pages:
- Deep URL paths
- Article-like slugs
- Presence of IDs or long titles
- News/story patterns
### Section pages:
- Short paths
- Category URLs
- Homepage or listing pages
- Trailing slash URLs
---
## ⚙️ Usage
### Install dependencies
```bash
pip install transformers torch |