File size: 7,153 Bytes
e67da43 a05700e e67da43 a05700e e67da43 a05700e e67da43 70cfc92 e67da43 70cfc92 e67da43 70cfc92 e67da43 70cfc92 e67da43 70cfc92 e67da43 70cfc92 e67da43 70cfc92 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | ---
language:
- en
license: mit
tags:
- text-classification
- html-analysis
- article-extraction
- xgboost
- web-scraping
datasets:
- Allanatrix/articles
metrics:
- accuracy
- f1
library_name: xgboost
---
# Article Extraction Outcome Classifier
A fast, lightweight classifier that categorizes web article extraction outcomes with 90% accuarcy
## Model Description
This model predicts whether HTML extraction succeeded, failed, or returned a non-article page. It combines rule-based heuristics for speed with XGBoost for accuracy on ambiguous cases.
## Classes
| Class | Description |
|-------|-------------|
| `full_article_extracted` | Complete article successfully extracted |
| `partial_article_extracted` | Article partially extracted (incomplete) |
| `api_provider_error` | External API/service failure |
| `other_failure` | Low-confidence failure (catch-all) |
| `full_page_not_article` | Page is not an article (nav, homepage, etc.) |
## Performance
~90% accuracy on a large, real-world test set, with strong performance on dominant classes
| Class | Precision | Recall | F1-score | Support |
| ------------------------- | --------- | ------ | -------- | ------- |
| full_article_extracted | 0.91 | 0.84 | 0.87 | 1,312 |
| partial_article_extracted | 0.76 | 0.63 | 0.69 | 92 |
| api_provider_error | 0.95 | 0.93 | 0.94 | 627 |
| other_failure | 0.41 | 0.28 | 0.33 | 44 |
| full_page_not_article | 0.92 | 0.97 | 0.94 | 11,821 |
| **Accuracy** | — | — | **0.90** | 13,852 |
| **Macro Avg** | 0.79 | 0.73 | 0.72 | 13,852 |
| **Weighted Avg** | 0.90 | 0.90 | 0.90 | 13,852 |
```python
import numpy as np
import torch
from sklearn.preprocessing import StandardScaler
# Load model
artifacts = torch.load("artifacts.pt")
scaler = artifacts["scaler"]
model = artifacts["xgb_model"]
id_to_label = artifacts["id_to_label"]
# Extract features (26 features from HTML prefix)
def extract_features(html: str, max_chars: int = 64000) -> dict:
prefix = html[:max_chars].lower()
features = {
"length_chars": len(html),
"prefix_len": len(prefix),
"ws_ratio": sum(c.isspace() for c in prefix) / len(prefix) if prefix else 0,
"digit_ratio": sum(c.isdigit() for c in prefix) / len(prefix) if prefix else 0,
"punct_ratio": sum(c in ".,;:!?" for c in prefix) / len(prefix) if prefix else 0,
# Keyword counts
"cookie": prefix.count("cookie") + prefix.count("consent"),
"subscribe": prefix.count("subscribe") + prefix.count("newsletter"),
"legal": prefix.count("privacy policy") + prefix.count("terms of service"),
"error": prefix.count("error") + prefix.count("timeout") + prefix.count("rate limit"),
"nav": prefix.count("home") + prefix.count("menu") + prefix.count("navigation"),
"article_kw": prefix.count("published") + prefix.count("reading time"),
"meta_article_kw": prefix.count("og:article") + prefix.count("article:published"),
# Tag counts
"n_p": prefix.count("<p"),
"n_a": prefix.count("<a"),
"n_h1": prefix.count("<h1"),
"n_h2": prefix.count("<h2"),
"n_h3": prefix.count("<h3"),
"n_article": prefix.count("<article"),
"n_main": prefix.count("<main"),
"n_time": prefix.count("<time"),
"n_script": prefix.count("<script"),
"n_style": prefix.count("<style"),
"n_nav": prefix.count("<nav"),
}
# Density features
kb = len(prefix) / 1000.0
features["link_density"] = features["n_a"] / kb if kb > 0 else 0
features["para_density"] = features["n_p"] / kb if kb > 0 else 0
features["script_density"] = features["n_script"] / kb if kb > 0 else 0
features["heading_score"] = features["n_h1"] * 3 + features["n_h2"] * 2 + features["n_h3"]
return features
# Predict
features = extract_features(html_string)
NUM_COLS = ["length_chars", "prefix_len", "ws_ratio", "digit_ratio", "punct_ratio",
"cookie", "subscribe", "legal", "error", "nav", "article_kw", "meta_article_kw",
"n_p", "n_a", "n_h1", "n_h2", "n_h3", "n_article", "n_main", "n_time",
"n_script", "n_style", "n_nav", "link_density", "para_density",
"script_density", "heading_score"]
X = np.array([features[col] for col in NUM_COLS]).reshape(1, -1).astype(np.float32)
X_scaled = scaler.transform(X)
prediction = model.predict(X_scaled)[0]
print(f"Outcome: {id_to_label[prediction]}")
```
### Optional: Rule-Based Fast Path
For 80%+ of cases, you can skip the model entirely:
```python
def apply_rules(features: dict) -> str | None:
"""Returns class label or None if ambiguous."""
if features["error"] >= 3:
return "api_provider_error"
if features["meta_article_kw"] >= 2 and features["n_p"] >= 10:
return "full_article_extracted"
if features["nav"] >= 5 and features["n_p"] < 5 and features["link_density"] > 20:
return "full_page_not_article"
return None # Use ML model
# Try rules first
rule_result = apply_rules(features)
if rule_result:
print(f"Outcome (rule-based): {rule_result}")
else:
# Fall back to model
prediction = model.predict(X_scaled)[0]
print(f"Outcome (model): {id_to_label[prediction]}")
```
## Training Data
- **Dataset:** `Allanatrix/articles` (194,183 HTML pages)
- **Labeled samples:** 138,523 (LLM-labeled)
- **Labeling method:** Distillation from large language models
- **Primary teacher:** GPT-5
- **Secondary / adjudicator:** Qwen
- **Train/Val/Test split:** 110,819 / 13,852 / 13,852
- **Class distribution:** ~85% non-articles, ~10% full articles, ~4% errors, ~1% partial articles
## Model Details
- **Algorithm:** XGBoost (GPU-trained)
- **Features:** 26 hand-crafted features (HTML structure, keyword counts, density metrics)
- **Training:** 500 boosting rounds with early stopping
- **Hardware:** Single GPU (CUDA)
- **Training time:** ~6 minutes
### Features Used
- **Content statistics:** length, whitespace ratio, digit and punctuation ratios
- **Keyword signals:** error messages, article indicators, navigation text
- **HTML structure:** paragraph, link, heading, script, style, and nav tag counts
- **Density metrics:** links/KB, paragraphs/KB, scripts/KB, heading score
## Limitations
- Only analyzes the first 64KB of HTML (important metadata must appear early)
- Labels are generated by LLMs rather than direct human annotation
- Some classes (e.g. `other_failure`) have limited representation
- Optimized primarily for English-language web pages
## Intended Use
**Primary use cases:**
- Quality control for article extraction pipelines
- Monitoring extraction API health and failure modes
- Fast filtering of non-article pages before downstream processing
- Analytics on extraction success and failure rates
**Not suitable for:**
- Language detection
- Content quality assessment
- Paywall detection
- Full content extraction
## Model Card Authors
Allanatrix
|