Update README.md
Browse files
README.md
CHANGED
|
@@ -156,39 +156,42 @@ else:
|
|
| 156 |
## Training Data
|
| 157 |
|
| 158 |
- **Dataset:** `Allanatrix/articles` (194,183 HTML pages)
|
| 159 |
-
- **Labeled samples:** 138,523 (
|
|
|
|
|
|
|
|
|
|
| 160 |
- **Train/Val/Test split:** 110,819 / 13,852 / 13,852
|
| 161 |
-
- **Class distribution:** 85% non-articles, 10% full articles, 4% errors, 1% partial
|
| 162 |
|
| 163 |
## Model Details
|
| 164 |
|
| 165 |
- **Algorithm:** XGBoost (GPU-trained)
|
| 166 |
-
- **Features:** 26 hand-crafted features (HTML structure,
|
| 167 |
- **Training:** 500 boosting rounds with early stopping
|
| 168 |
- **Hardware:** Single GPU (CUDA)
|
| 169 |
- **Training time:** ~6 minutes
|
| 170 |
|
| 171 |
### Features Used
|
| 172 |
|
| 173 |
-
- Content: length, whitespace ratio, digit
|
| 174 |
-
-
|
| 175 |
-
-
|
| 176 |
-
-
|
| 177 |
|
| 178 |
## Limitations
|
| 179 |
|
| 180 |
-
- Only analyzes first 64KB of HTML (
|
| 181 |
-
-
|
| 182 |
-
- `other_failure`
|
| 183 |
-
- Optimized for English-language web pages
|
| 184 |
|
| 185 |
## Intended Use
|
| 186 |
|
| 187 |
**Primary use cases:**
|
| 188 |
- Quality control for article extraction pipelines
|
| 189 |
-
- Monitoring extraction API health
|
| 190 |
-
-
|
| 191 |
-
- Analytics on extraction success rates
|
| 192 |
|
| 193 |
**Not suitable for:**
|
| 194 |
- Language detection
|
|
@@ -198,4 +201,4 @@ else:
|
|
| 198 |
|
| 199 |
## Model Card Authors
|
| 200 |
|
| 201 |
-
Allanatrix
|
|
|
|
| 156 |
## Training Data
|
| 157 |
|
| 158 |
- **Dataset:** `Allanatrix/articles` (194,183 HTML pages)
|
| 159 |
+
- **Labeled samples:** 138,523 (LLM-labeled)
|
| 160 |
+
- **Labeling method:** Distillation from large language models
|
| 161 |
+
- **Primary teacher:** GPT-5
|
| 162 |
+
- **Secondary / adjudicator:** Qwen
|
| 163 |
- **Train/Val/Test split:** 110,819 / 13,852 / 13,852
|
| 164 |
+
- **Class distribution:** ~85% non-articles, ~10% full articles, ~4% errors, ~1% partial articles
|
| 165 |
|
| 166 |
## Model Details
|
| 167 |
|
| 168 |
- **Algorithm:** XGBoost (GPU-trained)
|
| 169 |
+
- **Features:** 26 hand-crafted features (HTML structure, keyword counts, density metrics)
|
| 170 |
- **Training:** 500 boosting rounds with early stopping
|
| 171 |
- **Hardware:** Single GPU (CUDA)
|
| 172 |
- **Training time:** ~6 minutes
|
| 173 |
|
| 174 |
### Features Used
|
| 175 |
|
| 176 |
+
- **Content statistics:** length, whitespace ratio, digit and punctuation ratios
|
| 177 |
+
- **Keyword signals:** error messages, article indicators, navigation text
|
| 178 |
+
- **HTML structure:** paragraph, link, heading, script, style, and nav tag counts
|
| 179 |
+
- **Density metrics:** links/KB, paragraphs/KB, scripts/KB, heading score
|
| 180 |
|
| 181 |
## Limitations
|
| 182 |
|
| 183 |
+
- Only analyzes the first 64KB of HTML (important metadata must appear early)
|
| 184 |
+
- Labels are generated by LLMs rather than direct human annotation
|
| 185 |
+
- Some classes (e.g. `other_failure`) have limited representation
|
| 186 |
+
- Optimized primarily for English-language web pages
|
| 187 |
|
| 188 |
## Intended Use
|
| 189 |
|
| 190 |
**Primary use cases:**
|
| 191 |
- Quality control for article extraction pipelines
|
| 192 |
+
- Monitoring extraction API health and failure modes
|
| 193 |
+
- Fast filtering of non-article pages before downstream processing
|
| 194 |
+
- Analytics on extraction success and failure rates
|
| 195 |
|
| 196 |
**Not suitable for:**
|
| 197 |
- Language detection
|
|
|
|
| 201 |
|
| 202 |
## Model Card Authors
|
| 203 |
|
| 204 |
+
Allanatrix
|