Allanatrix commited on
Commit
70cfc92
·
verified ·
1 Parent(s): b358acd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -15
README.md CHANGED
@@ -156,39 +156,42 @@ else:
156
  ## Training Data
157
 
158
  - **Dataset:** `Allanatrix/articles` (194,183 HTML pages)
159
- - **Labeled samples:** 138,523 (weak labels from heuristics)
 
 
 
160
  - **Train/Val/Test split:** 110,819 / 13,852 / 13,852
161
- - **Class distribution:** 85% non-articles, 10% full articles, 4% errors, 1% partial
162
 
163
  ## Model Details
164
 
165
  - **Algorithm:** XGBoost (GPU-trained)
166
- - **Features:** 26 hand-crafted features (HTML structure, keywords, densities)
167
  - **Training:** 500 boosting rounds with early stopping
168
  - **Hardware:** Single GPU (CUDA)
169
  - **Training time:** ~6 minutes
170
 
171
  ### Features Used
172
 
173
- - Content: length, whitespace ratio, digit/punct ratios
174
- - Keywords: error messages, article indicators, navigation text
175
- - Structure: paragraph, link, heading, script tag counts
176
- - Densities: links/KB, paragraphs/KB, scripts/KB
177
 
178
  ## Limitations
179
 
180
- - Only analyzes first 64KB of HTML (meta tags must appear early)
181
- - Trained on weak labels (heuristic-based, not human-annotated)
182
- - `other_failure` class has minimal representation in training data
183
- - Optimized for English-language web pages
184
 
185
  ## Intended Use
186
 
187
  **Primary use cases:**
188
  - Quality control for article extraction pipelines
189
- - Monitoring extraction API health (error detection)
190
- - Filtering non-article pages before processing
191
- - Analytics on extraction success rates
192
 
193
  **Not suitable for:**
194
  - Language detection
@@ -198,4 +201,4 @@ else:
198
 
199
  ## Model Card Authors
200
 
201
- Allanatrix
 
156
  ## Training Data
157
 
158
  - **Dataset:** `Allanatrix/articles` (194,183 HTML pages)
159
+ - **Labeled samples:** 138,523 (LLM-labeled)
160
+ - **Labeling method:** Distillation from large language models
161
+ - **Primary teacher:** GPT-5
162
+ - **Secondary / adjudicator:** Qwen
163
  - **Train/Val/Test split:** 110,819 / 13,852 / 13,852
164
+ - **Class distribution:** ~85% non-articles, ~10% full articles, ~4% errors, ~1% partial articles
165
 
166
  ## Model Details
167
 
168
  - **Algorithm:** XGBoost (GPU-trained)
169
+ - **Features:** 26 hand-crafted features (HTML structure, keyword counts, density metrics)
170
  - **Training:** 500 boosting rounds with early stopping
171
  - **Hardware:** Single GPU (CUDA)
172
  - **Training time:** ~6 minutes
173
 
174
  ### Features Used
175
 
176
+ - **Content statistics:** length, whitespace ratio, digit and punctuation ratios
177
+ - **Keyword signals:** error messages, article indicators, navigation text
178
+ - **HTML structure:** paragraph, link, heading, script, style, and nav tag counts
179
+ - **Density metrics:** links/KB, paragraphs/KB, scripts/KB, heading score
180
 
181
  ## Limitations
182
 
183
+ - Only analyzes the first 64KB of HTML (important metadata must appear early)
184
+ - Labels are generated by LLMs rather than direct human annotation
185
+ - Some classes (e.g. `other_failure`) have limited representation
186
+ - Optimized primarily for English-language web pages
187
 
188
  ## Intended Use
189
 
190
  **Primary use cases:**
191
  - Quality control for article extraction pipelines
192
+ - Monitoring extraction API health and failure modes
193
+ - Fast filtering of non-article pages before downstream processing
194
+ - Analytics on extraction success and failure rates
195
 
196
  **Not suitable for:**
197
  - Language detection
 
201
 
202
  ## Model Card Authors
203
 
204
+ Allanatrix