metadata
The model was trained on a large-scale news URL dataset:
- Source: Infini News Corpus
- Dataset: ruggsea/infini-news-corpus
- URLs extracted from real-world news websites
🏷️ Labeling Strategy
Since manual labeling is expensive, weak supervision rules were used:
Content pages:
- Deep URL paths
- Article-like slugs
- Presence of IDs or long titles
- News/story patterns
Section pages:
- Short paths
- Category URLs
- Homepage or listing pages
- Trailing slash URLs
⚙️ Usage
Install dependencies
pip install transformers torch