--- language: - en license: mit tags: - text-classification - html-analysis - article-extraction - xgboost - web-scraping datasets: - Allanatrix/articles metrics: - accuracy - f1 library_name: xgboost --- # Article Extraction Outcome Classifier A fast, lightweight classifier that categorizes web article extraction outcomes with 90% accuarcy ## Model Description This model predicts whether HTML extraction succeeded, failed, or returned a non-article page. It combines rule-based heuristics for speed with XGBoost for accuracy on ambiguous cases. ## Classes | Class | Description | |-------|-------------| | `full_article_extracted` | Complete article successfully extracted | | `partial_article_extracted` | Article partially extracted (incomplete) | | `api_provider_error` | External API/service failure | | `other_failure` | Low-confidence failure (catch-all) | | `full_page_not_article` | Page is not an article (nav, homepage, etc.) | ## Performance ~90% accuracy on a large, real-world test set, with strong performance on dominant classes | Class | Precision | Recall | F1-score | Support | | ------------------------- | --------- | ------ | -------- | ------- | | full_article_extracted | 0.91 | 0.84 | 0.87 | 1,312 | | partial_article_extracted | 0.76 | 0.63 | 0.69 | 92 | | api_provider_error | 0.95 | 0.93 | 0.94 | 627 | | other_failure | 0.41 | 0.28 | 0.33 | 44 | | full_page_not_article | 0.92 | 0.97 | 0.94 | 11,821 | | **Accuracy** | — | — | **0.90** | 13,852 | | **Macro Avg** | 0.79 | 0.73 | 0.72 | 13,852 | | **Weighted Avg** | 0.90 | 0.90 | 0.90 | 13,852 | ```python import numpy as np import torch from sklearn.preprocessing import StandardScaler # Load model artifacts = torch.load("artifacts.pt") scaler = artifacts["scaler"] model = artifacts["xgb_model"] id_to_label = artifacts["id_to_label"] # Extract features (26 features from HTML prefix) def extract_features(html: str, max_chars: int = 64000) -> dict: prefix = html[:max_chars].lower() features = { "length_chars": len(html), "prefix_len": len(prefix), "ws_ratio": sum(c.isspace() for c in prefix) / len(prefix) if prefix else 0, "digit_ratio": sum(c.isdigit() for c in prefix) / len(prefix) if prefix else 0, "punct_ratio": sum(c in ".,;:!?" for c in prefix) / len(prefix) if prefix else 0, # Keyword counts "cookie": prefix.count("cookie") + prefix.count("consent"), "subscribe": prefix.count("subscribe") + prefix.count("newsletter"), "legal": prefix.count("privacy policy") + prefix.count("terms of service"), "error": prefix.count("error") + prefix.count("timeout") + prefix.count("rate limit"), "nav": prefix.count("home") + prefix.count("menu") + prefix.count("navigation"), "article_kw": prefix.count("published") + prefix.count("reading time"), "meta_article_kw": prefix.count("og:article") + prefix.count("article:published"), # Tag counts "n_p": prefix.count("
0 else 0 features["para_density"] = features["n_p"] / kb if kb > 0 else 0 features["script_density"] = features["n_script"] / kb if kb > 0 else 0 features["heading_score"] = features["n_h1"] * 3 + features["n_h2"] * 2 + features["n_h3"] return features # Predict features = extract_features(html_string) NUM_COLS = ["length_chars", "prefix_len", "ws_ratio", "digit_ratio", "punct_ratio", "cookie", "subscribe", "legal", "error", "nav", "article_kw", "meta_article_kw", "n_p", "n_a", "n_h1", "n_h2", "n_h3", "n_article", "n_main", "n_time", "n_script", "n_style", "n_nav", "link_density", "para_density", "script_density", "heading_score"] X = np.array([features[col] for col in NUM_COLS]).reshape(1, -1).astype(np.float32) X_scaled = scaler.transform(X) prediction = model.predict(X_scaled)[0] print(f"Outcome: {id_to_label[prediction]}") ``` ### Optional: Rule-Based Fast Path For 80%+ of cases, you can skip the model entirely: ```python def apply_rules(features: dict) -> str | None: """Returns class label or None if ambiguous.""" if features["error"] >= 3: return "api_provider_error" if features["meta_article_kw"] >= 2 and features["n_p"] >= 10: return "full_article_extracted" if features["nav"] >= 5 and features["n_p"] < 5 and features["link_density"] > 20: return "full_page_not_article" return None # Use ML model # Try rules first rule_result = apply_rules(features) if rule_result: print(f"Outcome (rule-based): {rule_result}") else: # Fall back to model prediction = model.predict(X_scaled)[0] print(f"Outcome (model): {id_to_label[prediction]}") ``` ## Training Data - **Dataset:** `Allanatrix/articles` (194,183 HTML pages) - **Labeled samples:** 138,523 (LLM-labeled) - **Labeling method:** Distillation from large language models - **Primary teacher:** GPT-5 - **Secondary / adjudicator:** Qwen - **Train/Val/Test split:** 110,819 / 13,852 / 13,852 - **Class distribution:** ~85% non-articles, ~10% full articles, ~4% errors, ~1% partial articles ## Model Details - **Algorithm:** XGBoost (GPU-trained) - **Features:** 26 hand-crafted features (HTML structure, keyword counts, density metrics) - **Training:** 500 boosting rounds with early stopping - **Hardware:** Single GPU (CUDA) - **Training time:** ~6 minutes ### Features Used - **Content statistics:** length, whitespace ratio, digit and punctuation ratios - **Keyword signals:** error messages, article indicators, navigation text - **HTML structure:** paragraph, link, heading, script, style, and nav tag counts - **Density metrics:** links/KB, paragraphs/KB, scripts/KB, heading score ## Limitations - Only analyzes the first 64KB of HTML (important metadata must appear early) - Labels are generated by LLMs rather than direct human annotation - Some classes (e.g. `other_failure`) have limited representation - Optimized primarily for English-language web pages ## Intended Use **Primary use cases:** - Quality control for article extraction pipelines - Monitoring extraction API health and failure modes - Fast filtering of non-article pages before downstream processing - Analytics on extraction success and failure rates **Not suitable for:** - Language detection - Content quality assessment - Paywall detection - Full content extraction ## Model Card Authors Allanatrix