Update README.md
Browse files
README.md
CHANGED
|
@@ -10,4 +10,44 @@ tags:
|
|
| 10 |
---
|
| 11 |
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
|
| 13 |
+
# Sports Text Classifier
|
| 14 |
+
|
| 15 |
+
## Overview
|
| 16 |
+
|
| 17 |
+
This Sports Text Classifier is a crucial component of the OnlySports Dataset creation pipeline. It's designed to accurately identify and extract sports-related documents from a large corpus of web content.
|
| 18 |
+
|
| 19 |
+
## Model Architecture
|
| 20 |
+
|
| 21 |
+
- Base model: [Snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs)
|
| 22 |
+
- Additional layer: Binary classification layer
|
| 23 |
+
- Training: 10 epochs with a learning rate of 3e-4
|
| 24 |
+
|
| 25 |
+
## Performance
|
| 26 |
+
|
| 27 |
+
The classifier achieves exceptional accuracy in distinguishing between sports and non-sports documents:
|
| 28 |
+
|
| 29 |
+

|
| 30 |
+
|
| 31 |
+
## Training Data
|
| 32 |
+
|
| 33 |
+
The classifier was trained on a balanced dataset of sports and non-sports content:
|
| 34 |
+
|
| 35 |
+
- 64k samples from seven prestigious sports websites
|
| 36 |
+
- 36k non-sports text documents classified using GPT-3.5
|
| 37 |
+
|
| 38 |
+
## Usage
|
| 39 |
+
|
| 40 |
+
This classifier is primarily used in the creation of the OnlySports Dataset. It can be applied to filter large text corpora for sports-related content with high accuracy.
|
| 41 |
+
|
| 42 |
+
## Integration
|
| 43 |
+
|
| 44 |
+
The classifier is integrated into a MapReduce architecture for efficient processing of large-scale datasets. It's used in conjunction with URL keyword filtering to create a comprehensive sports text dataset.
|
| 45 |
+
|
| 46 |
+
## Related Projects
|
| 47 |
+
|
| 48 |
+
This classifier is part of the larger OnlySports collection, which includes:
|
| 49 |
+
|
| 50 |
+
- [OnlySports Dataset](https://huggingface.co/collections/Chrisneverdie/onlysports-66b3e5cf595eb81220cc27a6)
|
| 51 |
+
- [OnlySportsLM](https://huggingface.co/Chrisneverdie/OnlySportsLM_196M)
|
| 52 |
+
|
| 53 |
+
For more information, visit our [GitHub repository](https://github.com/chrischenhub/OnlySportsLM).
|