Summary_model / README.md
Allanatrix's picture
Update README.md
a05700e verified
---
language:
- en
license: mit
tags:
- text-classification
- html-analysis
- article-extraction
- xgboost
- web-scraping
datasets:
- Allanatrix/articles
metrics:
- accuracy
- f1
library_name: xgboost
---
# Article Extraction Outcome Classifier
A fast, lightweight classifier that categorizes web article extraction outcomes with 90% accuarcy
## Model Description
This model predicts whether HTML extraction succeeded, failed, or returned a non-article page. It combines rule-based heuristics for speed with XGBoost for accuracy on ambiguous cases.
## Classes
| Class | Description |
|-------|-------------|
| `full_article_extracted` | Complete article successfully extracted |
| `partial_article_extracted` | Article partially extracted (incomplete) |
| `api_provider_error` | External API/service failure |
| `other_failure` | Low-confidence failure (catch-all) |
| `full_page_not_article` | Page is not an article (nav, homepage, etc.) |
## Performance
~90% accuracy on a large, real-world test set, with strong performance on dominant classes
| Class | Precision | Recall | F1-score | Support |
| ------------------------- | --------- | ------ | -------- | ------- |
| full_article_extracted | 0.91 | 0.84 | 0.87 | 1,312 |
| partial_article_extracted | 0.76 | 0.63 | 0.69 | 92 |
| api_provider_error | 0.95 | 0.93 | 0.94 | 627 |
| other_failure | 0.41 | 0.28 | 0.33 | 44 |
| full_page_not_article | 0.92 | 0.97 | 0.94 | 11,821 |
| **Accuracy** | — | — | **0.90** | 13,852 |
| **Macro Avg** | 0.79 | 0.73 | 0.72 | 13,852 |
| **Weighted Avg** | 0.90 | 0.90 | 0.90 | 13,852 |
```python
import numpy as np
import torch
from sklearn.preprocessing import StandardScaler
# Load model
artifacts = torch.load("artifacts.pt")
scaler = artifacts["scaler"]
model = artifacts["xgb_model"]
id_to_label = artifacts["id_to_label"]
# Extract features (26 features from HTML prefix)
def extract_features(html: str, max_chars: int = 64000) -> dict:
prefix = html[:max_chars].lower()
features = {
"length_chars": len(html),
"prefix_len": len(prefix),
"ws_ratio": sum(c.isspace() for c in prefix) / len(prefix) if prefix else 0,
"digit_ratio": sum(c.isdigit() for c in prefix) / len(prefix) if prefix else 0,
"punct_ratio": sum(c in ".,;:!?" for c in prefix) / len(prefix) if prefix else 0,
# Keyword counts
"cookie": prefix.count("cookie") + prefix.count("consent"),
"subscribe": prefix.count("subscribe") + prefix.count("newsletter"),
"legal": prefix.count("privacy policy") + prefix.count("terms of service"),
"error": prefix.count("error") + prefix.count("timeout") + prefix.count("rate limit"),
"nav": prefix.count("home") + prefix.count("menu") + prefix.count("navigation"),
"article_kw": prefix.count("published") + prefix.count("reading time"),
"meta_article_kw": prefix.count("og:article") + prefix.count("article:published"),
# Tag counts
"n_p": prefix.count("<p"),
"n_a": prefix.count("<a"),
"n_h1": prefix.count("<h1"),
"n_h2": prefix.count("<h2"),
"n_h3": prefix.count("<h3"),
"n_article": prefix.count("<article"),
"n_main": prefix.count("<main"),
"n_time": prefix.count("<time"),
"n_script": prefix.count("<script"),
"n_style": prefix.count("<style"),
"n_nav": prefix.count("<nav"),
}
# Density features
kb = len(prefix) / 1000.0
features["link_density"] = features["n_a"] / kb if kb > 0 else 0
features["para_density"] = features["n_p"] / kb if kb > 0 else 0
features["script_density"] = features["n_script"] / kb if kb > 0 else 0
features["heading_score"] = features["n_h1"] * 3 + features["n_h2"] * 2 + features["n_h3"]
return features
# Predict
features = extract_features(html_string)
NUM_COLS = ["length_chars", "prefix_len", "ws_ratio", "digit_ratio", "punct_ratio",
"cookie", "subscribe", "legal", "error", "nav", "article_kw", "meta_article_kw",
"n_p", "n_a", "n_h1", "n_h2", "n_h3", "n_article", "n_main", "n_time",
"n_script", "n_style", "n_nav", "link_density", "para_density",
"script_density", "heading_score"]
X = np.array([features[col] for col in NUM_COLS]).reshape(1, -1).astype(np.float32)
X_scaled = scaler.transform(X)
prediction = model.predict(X_scaled)[0]
print(f"Outcome: {id_to_label[prediction]}")
```
### Optional: Rule-Based Fast Path
For 80%+ of cases, you can skip the model entirely:
```python
def apply_rules(features: dict) -> str | None:
"""Returns class label or None if ambiguous."""
if features["error"] >= 3:
return "api_provider_error"
if features["meta_article_kw"] >= 2 and features["n_p"] >= 10:
return "full_article_extracted"
if features["nav"] >= 5 and features["n_p"] < 5 and features["link_density"] > 20:
return "full_page_not_article"
return None # Use ML model
# Try rules first
rule_result = apply_rules(features)
if rule_result:
print(f"Outcome (rule-based): {rule_result}")
else:
# Fall back to model
prediction = model.predict(X_scaled)[0]
print(f"Outcome (model): {id_to_label[prediction]}")
```
## Training Data
- **Dataset:** `Allanatrix/articles` (194,183 HTML pages)
- **Labeled samples:** 138,523 (LLM-labeled)
- **Labeling method:** Distillation from large language models
- **Primary teacher:** GPT-5
- **Secondary / adjudicator:** Qwen
- **Train/Val/Test split:** 110,819 / 13,852 / 13,852
- **Class distribution:** ~85% non-articles, ~10% full articles, ~4% errors, ~1% partial articles
## Model Details
- **Algorithm:** XGBoost (GPU-trained)
- **Features:** 26 hand-crafted features (HTML structure, keyword counts, density metrics)
- **Training:** 500 boosting rounds with early stopping
- **Hardware:** Single GPU (CUDA)
- **Training time:** ~6 minutes
### Features Used
- **Content statistics:** length, whitespace ratio, digit and punctuation ratios
- **Keyword signals:** error messages, article indicators, navigation text
- **HTML structure:** paragraph, link, heading, script, style, and nav tag counts
- **Density metrics:** links/KB, paragraphs/KB, scripts/KB, heading score
## Limitations
- Only analyzes the first 64KB of HTML (important metadata must appear early)
- Labels are generated by LLMs rather than direct human annotation
- Some classes (e.g. `other_failure`) have limited representation
- Optimized primarily for English-language web pages
## Intended Use
**Primary use cases:**
- Quality control for article extraction pipelines
- Monitoring extraction API health and failure modes
- Fast filtering of non-article pages before downstream processing
- Analytics on extraction success and failure rates
**Not suitable for:**
- Language detection
- Content quality assessment
- Paywall detection
- Full content extraction
## Model Card Authors
Allanatrix