Allanatrix
/

Summary_model

Text Classification

article-extraction

Model card Files Files and versions

Allanatrix commited on 29 days ago

Commit

70cfc92

·

verified ·

1 Parent(s): b358acd

Update README.md

Files changed (1) hide show

README.md +18 -15

README.md CHANGED Viewed

@@ -156,39 +156,42 @@ else:
 ## Training Data
 - **Dataset:** `Allanatrix/articles` (194,183 HTML pages)
-- **Labeled samples:** 138,523 (weak labels from heuristics)
 - **Train/Val/Test split:** 110,819 / 13,852 / 13,852
-- **Class distribution:** 85% non-articles, 10% full articles, 4% errors, 1% partial
 ## Model Details
 - **Algorithm:** XGBoost (GPU-trained)
-- **Features:** 26 hand-crafted features (HTML structure, keywords, densities)
 - **Training:** 500 boosting rounds with early stopping
 - **Hardware:** Single GPU (CUDA)
 - **Training time:** ~6 minutes
 ### Features Used
-- Content: length, whitespace ratio, digit/punct ratios
-- Keywords: error messages, article indicators, navigation text
-- Structure: paragraph, link, heading, script tag counts
-- Densities: links/KB, paragraphs/KB, scripts/KB
 ## Limitations
-- Only analyzes first 64KB of HTML (meta tags must appear early)
-- Trained on weak labels (heuristic-based, not human-annotated)
-- `other_failure` class has minimal representation in training data
-- Optimized for English-language web pages
 ## Intended Use
 **Primary use cases:**
 - Quality control for article extraction pipelines
-- Monitoring extraction API health (error detection)
-- Filtering non-article pages before processing
-- Analytics on extraction success rates
 **Not suitable for:**
 - Language detection
@@ -198,4 +201,4 @@ else:
 ## Model Card Authors
-Allanatrix

 ## Training Data
 - **Dataset:** `Allanatrix/articles` (194,183 HTML pages)
+- **Labeled samples:** 138,523 (LLM-labeled)
+- **Labeling method:** Distillation from large language models
+  - **Primary teacher:** GPT-5
+  - **Secondary / adjudicator:** Qwen
 - **Train/Val/Test split:** 110,819 / 13,852 / 13,852
+- **Class distribution:** ~85% non-articles, ~10% full articles, ~4% errors, ~1% partial articles
 ## Model Details
 - **Algorithm:** XGBoost (GPU-trained)
+- **Features:** 26 hand-crafted features (HTML structure, keyword counts, density metrics)
 - **Training:** 500 boosting rounds with early stopping
 - **Hardware:** Single GPU (CUDA)
 - **Training time:** ~6 minutes
 ### Features Used
+- **Content statistics:** length, whitespace ratio, digit and punctuation ratios
+- **Keyword signals:** error messages, article indicators, navigation text
+- **HTML structure:** paragraph, link, heading, script, style, and nav tag counts
+- **Density metrics:** links/KB, paragraphs/KB, scripts/KB, heading score
 ## Limitations
+- Only analyzes the first 64KB of HTML (important metadata must appear early)
+- Labels are generated by LLMs rather than direct human annotation
+- Some classes (e.g. `other_failure`) have limited representation
+- Optimized primarily for English-language web pages
 ## Intended Use
 **Primary use cases:**
 - Quality control for article extraction pipelines
+- Monitoring extraction API health and failure modes
+- Fast filtering of non-article pages before downstream processing
+- Analytics on extraction success and failure rates
 **Not suitable for:**
 - Language detection
 ## Model Card Authors
+Allanatrix