| | --- |
| | language: |
| | - en |
| | license: mit |
| | tags: |
| | - text-classification |
| | - html-analysis |
| | - article-extraction |
| | - xgboost |
| | - web-scraping |
| | datasets: |
| | - Allanatrix/articles |
| | metrics: |
| | - accuracy |
| | - f1 |
| | library_name: xgboost |
| | --- |
| | |
| | # Article Extraction Outcome Classifier |
| |
|
| | A fast, lightweight classifier that categorizes web article extraction outcomes with 90% accuarcy |
| |
|
| | ## Model Description |
| |
|
| | This model predicts whether HTML extraction succeeded, failed, or returned a non-article page. It combines rule-based heuristics for speed with XGBoost for accuracy on ambiguous cases. |
| |
|
| | ## Classes |
| |
|
| | | Class | Description | |
| | |-------|-------------| |
| | | `full_article_extracted` | Complete article successfully extracted | |
| | | `partial_article_extracted` | Article partially extracted (incomplete) | |
| | | `api_provider_error` | External API/service failure | |
| | | `other_failure` | Low-confidence failure (catch-all) | |
| | | `full_page_not_article` | Page is not an article (nav, homepage, etc.) | |
| |
|
| | ## Performance |
| |
|
| | ~90% accuracy on a large, real-world test set, with strong performance on dominant classes |
| |
|
| | | Class | Precision | Recall | F1-score | Support | |
| | | ------------------------- | --------- | ------ | -------- | ------- | |
| | | full_article_extracted | 0.91 | 0.84 | 0.87 | 1,312 | |
| | | partial_article_extracted | 0.76 | 0.63 | 0.69 | 92 | |
| | | api_provider_error | 0.95 | 0.93 | 0.94 | 627 | |
| | | other_failure | 0.41 | 0.28 | 0.33 | 44 | |
| | | full_page_not_article | 0.92 | 0.97 | 0.94 | 11,821 | |
| | | **Accuracy** | — | — | **0.90** | 13,852 | |
| | | **Macro Avg** | 0.79 | 0.73 | 0.72 | 13,852 | |
| | | **Weighted Avg** | 0.90 | 0.90 | 0.90 | 13,852 | |
| |
|
| |
|
| | ```python |
| | import numpy as np |
| | import torch |
| | from sklearn.preprocessing import StandardScaler |
| | |
| | # Load model |
| | artifacts = torch.load("artifacts.pt") |
| | scaler = artifacts["scaler"] |
| | model = artifacts["xgb_model"] |
| | id_to_label = artifacts["id_to_label"] |
| | |
| | # Extract features (26 features from HTML prefix) |
| | def extract_features(html: str, max_chars: int = 64000) -> dict: |
| | prefix = html[:max_chars].lower() |
| | |
| | features = { |
| | "length_chars": len(html), |
| | "prefix_len": len(prefix), |
| | "ws_ratio": sum(c.isspace() for c in prefix) / len(prefix) if prefix else 0, |
| | "digit_ratio": sum(c.isdigit() for c in prefix) / len(prefix) if prefix else 0, |
| | "punct_ratio": sum(c in ".,;:!?" for c in prefix) / len(prefix) if prefix else 0, |
| | # Keyword counts |
| | "cookie": prefix.count("cookie") + prefix.count("consent"), |
| | "subscribe": prefix.count("subscribe") + prefix.count("newsletter"), |
| | "legal": prefix.count("privacy policy") + prefix.count("terms of service"), |
| | "error": prefix.count("error") + prefix.count("timeout") + prefix.count("rate limit"), |
| | "nav": prefix.count("home") + prefix.count("menu") + prefix.count("navigation"), |
| | "article_kw": prefix.count("published") + prefix.count("reading time"), |
| | "meta_article_kw": prefix.count("og:article") + prefix.count("article:published"), |
| | # Tag counts |
| | "n_p": prefix.count("<p"), |
| | "n_a": prefix.count("<a"), |
| | "n_h1": prefix.count("<h1"), |
| | "n_h2": prefix.count("<h2"), |
| | "n_h3": prefix.count("<h3"), |
| | "n_article": prefix.count("<article"), |
| | "n_main": prefix.count("<main"), |
| | "n_time": prefix.count("<time"), |
| | "n_script": prefix.count("<script"), |
| | "n_style": prefix.count("<style"), |
| | "n_nav": prefix.count("<nav"), |
| | } |
| | |
| | # Density features |
| | kb = len(prefix) / 1000.0 |
| | features["link_density"] = features["n_a"] / kb if kb > 0 else 0 |
| | features["para_density"] = features["n_p"] / kb if kb > 0 else 0 |
| | features["script_density"] = features["n_script"] / kb if kb > 0 else 0 |
| | features["heading_score"] = features["n_h1"] * 3 + features["n_h2"] * 2 + features["n_h3"] |
| | |
| | return features |
| | |
| | # Predict |
| | features = extract_features(html_string) |
| | NUM_COLS = ["length_chars", "prefix_len", "ws_ratio", "digit_ratio", "punct_ratio", |
| | "cookie", "subscribe", "legal", "error", "nav", "article_kw", "meta_article_kw", |
| | "n_p", "n_a", "n_h1", "n_h2", "n_h3", "n_article", "n_main", "n_time", |
| | "n_script", "n_style", "n_nav", "link_density", "para_density", |
| | "script_density", "heading_score"] |
| | |
| | X = np.array([features[col] for col in NUM_COLS]).reshape(1, -1).astype(np.float32) |
| | X_scaled = scaler.transform(X) |
| | prediction = model.predict(X_scaled)[0] |
| | |
| | print(f"Outcome: {id_to_label[prediction]}") |
| | ``` |
| |
|
| | ### Optional: Rule-Based Fast Path |
| |
|
| | For 80%+ of cases, you can skip the model entirely: |
| |
|
| | ```python |
| | def apply_rules(features: dict) -> str | None: |
| | """Returns class label or None if ambiguous.""" |
| | if features["error"] >= 3: |
| | return "api_provider_error" |
| | |
| | if features["meta_article_kw"] >= 2 and features["n_p"] >= 10: |
| | return "full_article_extracted" |
| | |
| | if features["nav"] >= 5 and features["n_p"] < 5 and features["link_density"] > 20: |
| | return "full_page_not_article" |
| | |
| | return None # Use ML model |
| | |
| | # Try rules first |
| | rule_result = apply_rules(features) |
| | if rule_result: |
| | print(f"Outcome (rule-based): {rule_result}") |
| | else: |
| | # Fall back to model |
| | prediction = model.predict(X_scaled)[0] |
| | print(f"Outcome (model): {id_to_label[prediction]}") |
| | ``` |
| |
|
| | ## Training Data |
| |
|
| | - **Dataset:** `Allanatrix/articles` (194,183 HTML pages) |
| | - **Labeled samples:** 138,523 (LLM-labeled) |
| | - **Labeling method:** Distillation from large language models |
| | - **Primary teacher:** GPT-5 |
| | - **Secondary / adjudicator:** Qwen |
| | - **Train/Val/Test split:** 110,819 / 13,852 / 13,852 |
| | - **Class distribution:** ~85% non-articles, ~10% full articles, ~4% errors, ~1% partial articles |
| |
|
| | ## Model Details |
| |
|
| | - **Algorithm:** XGBoost (GPU-trained) |
| | - **Features:** 26 hand-crafted features (HTML structure, keyword counts, density metrics) |
| | - **Training:** 500 boosting rounds with early stopping |
| | - **Hardware:** Single GPU (CUDA) |
| | - **Training time:** ~6 minutes |
| |
|
| | ### Features Used |
| |
|
| | - **Content statistics:** length, whitespace ratio, digit and punctuation ratios |
| | - **Keyword signals:** error messages, article indicators, navigation text |
| | - **HTML structure:** paragraph, link, heading, script, style, and nav tag counts |
| | - **Density metrics:** links/KB, paragraphs/KB, scripts/KB, heading score |
| |
|
| | ## Limitations |
| |
|
| | - Only analyzes the first 64KB of HTML (important metadata must appear early) |
| | - Labels are generated by LLMs rather than direct human annotation |
| | - Some classes (e.g. `other_failure`) have limited representation |
| | - Optimized primarily for English-language web pages |
| |
|
| | ## Intended Use |
| |
|
| | **Primary use cases:** |
| | - Quality control for article extraction pipelines |
| | - Monitoring extraction API health and failure modes |
| | - Fast filtering of non-article pages before downstream processing |
| | - Analytics on extraction success and failure rates |
| |
|
| | **Not suitable for:** |
| | - Language detection |
| | - Content quality assessment |
| | - Paywall detection |
| | - Full content extraction |
| |
|
| | ## Model Card Authors |
| |
|
| | Allanatrix |
| |
|