SayedShaun's picture
Update README.md
c8c8e84 verified
metadata
The model was trained on a large-scale news URL dataset:
  - Source: Infini News Corpus
  - Dataset: ruggsea/infini-news-corpus
  - URLs extracted from real-world news websites

🏷️ Labeling Strategy

Since manual labeling is expensive, weak supervision rules were used:

Content pages:

  • Deep URL paths
  • Article-like slugs
  • Presence of IDs or long titles
  • News/story patterns

Section pages:

  • Short paths
  • Category URLs
  • Homepage or listing pages
  • Trailing slash URLs

⚙️ Usage

Install dependencies

pip install transformers torch