|
|
| --- |
| |
| |
|
|
| The model was trained on a large-scale news URL dataset: |
|
|
| - Source: Infini News Corpus |
| - Dataset: ruggsea/infini-news-corpus |
| - URLs extracted from real-world news websites |
|
|
| --- |
| |
| ## 🏷️ Labeling Strategy |
|
|
| Since manual labeling is expensive, weak supervision rules were used: |
|
|
| ### Content pages: |
| - Deep URL paths |
| - Article-like slugs |
| - Presence of IDs or long titles |
| - News/story patterns |
|
|
| ### Section pages: |
| - Short paths |
| - Category URLs |
| - Homepage or listing pages |
| - Trailing slash URLs |
|
|
| --- |
|
|
| ## ⚙️ Usage |
|
|
| ### Install dependencies |
| ```bash |
| pip install transformers torch |