--- ## 📊 Training Data The model was trained on a large-scale news URL dataset: - Source: Infini News Corpus - Dataset: ruggsea/infini-news-corpus - URLs extracted from real-world news websites --- ## 🏷️ Labeling Strategy Since manual labeling is expensive, weak supervision rules were used: ### Content pages: - Deep URL paths - Article-like slugs - Presence of IDs or long titles - News/story patterns ### Section pages: - Short paths - Category URLs - Homepage or listing pages - Trailing slash URLs --- ## ⚙️ Usage ### Install dependencies ```bash pip install transformers torch