SayedShaun's picture
Update README.md
64ac5b9 verified
---
## 📊 Training Data
The model was trained on a large-scale news URL dataset:
- Source: Infini News Corpus
- Dataset: ruggsea/infini-news-corpus
- URLs extracted from real-world news websites
---
## 🏷️ Labeling Strategy
Since manual labeling is expensive, weak supervision rules were used:
### Content pages:
- Deep URL paths
- Article-like slugs
- Presence of IDs or long titles
- News/story patterns
### Section pages:
- Short paths
- Category URLs
- Homepage or listing pages
- Trailing slash URLs
---
## ⚙️ Usage
### Install dependencies
```bash
pip install transformers torch