Allanatrix
/

Summary_model

+---
+language:
+- en
+license: mit
+tags:
+- text-classification
+- html-analysis
+- article-extraction
+- xgboost
+- web-scraping
+datasets:
+- Allanatrix/articles
+metrics:
+- accuracy
+- f1
+library_name: xgboost
+---
+# Article Extraction Outcome Classifier
+A fast, lightweight classifier that categorizes web article extraction outcomes with 99.99% accuracy.
+## Model Description
+This model predicts whether HTML extraction succeeded, failed, or returned a non-article page. It combines rule-based heuristics for speed with XGBoost for accuracy on ambiguous cases.
+**Key Features:**
+- Processes only first 64KB of HTML for speed
+- 99.99% accuracy on test set
+- Rule-based fast path handles 80%+ of cases instantly
+- Only 26 hand-crafted features (no large embeddings)
+## Classes
+| Class | Description |
+|-------|-------------|
+| `full_article_extracted` | Complete article successfully extracted |
+| `partial_article_extracted` | Article partially extracted (incomplete) |
+| `api_provider_error` | External API/service failure |
+| `other_failure` | Low-confidence failure (catch-all) |
+| `full_page_not_article` | Page is not an article (nav, homepage, etc.) |
+## Performance
+**Test Set Results (13,852 samples):**
+```
+Overall Accuracy: 99.99%
+Macro F1: 0.7976
+                           precision    recall  f1-score   support
+   full_article_extracted     0.9985    1.0000    0.9992      1312
+partial_article_extracted     1.0000    0.9783    0.9890        92
+       api_provider_error     1.0000    1.0000    1.0000       627
+            other_failure     0.0000    0.0000    0.0000         0
+    full_page_not_article     1.0000    1.0000    1.0000     11821
+```
+## Usage
+```python
+import numpy as np
+import torch
+from sklearn.preprocessing import StandardScaler
+# Load model
+artifacts = torch.load("artifacts.pt")
+scaler = artifacts["scaler"]
+model = artifacts["xgb_model"]
+id_to_label = artifacts["id_to_label"]
+# Extract features (26 features from HTML prefix)
+def extract_features(html: str, max_chars: int = 64000) -> dict:
+    prefix = html[:max_chars].lower()
+    features = {
+        "length_chars": len(html),
+        "prefix_len": len(prefix),
+        "ws_ratio": sum(c.isspace() for c in prefix) / len(prefix) if prefix else 0,
+        "digit_ratio": sum(c.isdigit() for c in prefix) / len(prefix) if prefix else 0,
+        "punct_ratio": sum(c in ".,;:!?" for c in prefix) / len(prefix) if prefix else 0,
+        # Keyword counts
+        "cookie": prefix.count("cookie") + prefix.count("consent"),
+        "subscribe": prefix.count("subscribe") + prefix.count("newsletter"),
+        "legal": prefix.count("privacy policy") + prefix.count("terms of service"),
+        "error": prefix.count("error") + prefix.count("timeout") + prefix.count("rate limit"),
+        "nav": prefix.count("home") + prefix.count("menu") + prefix.count("navigation"),
+        "article_kw": prefix.count("published") + prefix.count("reading time"),
+        "meta_article_kw": prefix.count("og:article") + prefix.count("article:published"),
+        # Tag counts
+        "n_p": prefix.count("<p"),
+        "n_a": prefix.count("<a"),
+        "n_h1": prefix.count("<h1"),
+        "n_h2": prefix.count("<h2"),
+        "n_h3": prefix.count("<h3"),
+        "n_article": prefix.count("<article"),
+        "n_main": prefix.count("<main"),
+        "n_time": prefix.count("<time"),
+        "n_script": prefix.count("<script"),
+        "n_style": prefix.count("<style"),
+        "n_nav": prefix.count("<nav"),
+    }
+    # Density features
+    kb = len(prefix) / 1000.0
+    features["link_density"] = features["n_a"] / kb if kb > 0 else 0
+    features["para_density"] = features["n_p"] / kb if kb > 0 else 0
+    features["script_density"] = features["n_script"] / kb if kb > 0 else 0
+    features["heading_score"] = features["n_h1"] * 3 + features["n_h2"] * 2 + features["n_h3"]
+    return features
+# Predict
+features = extract_features(html_string)
+NUM_COLS = ["length_chars", "prefix_len", "ws_ratio", "digit_ratio", "punct_ratio",
+            "cookie", "subscribe", "legal", "error", "nav", "article_kw", "meta_article_kw",
+            "n_p", "n_a", "n_h1", "n_h2", "n_h3", "n_article", "n_main", "n_time",
+            "n_script", "n_style", "n_nav", "link_density", "para_density",
+            "script_density", "heading_score"]
+X = np.array([features[col] for col in NUM_COLS]).reshape(1, -1).astype(np.float32)
+X_scaled = scaler.transform(X)
+prediction = model.predict(X_scaled)[0]
+print(f"Outcome: {id_to_label[prediction]}")
+```
+### Optional: Rule-Based Fast Path
+For 80%+ of cases, you can skip the model entirely:
+```python
+def apply_rules(features: dict) -> str | None:
+    """Returns class label or None if ambiguous."""
+    if features["error"] >= 3:
+        return "api_provider_error"
+    if features["meta_article_kw"] >= 2 and features["n_p"] >= 10:
+        return "full_article_extracted"
+    if features["nav"] >= 5 and features["n_p"] < 5 and features["link_density"] > 20:
+        return "full_page_not_article"
+    return None  # Use ML model
+# Try rules first
+rule_result = apply_rules(features)
+if rule_result:
+    print(f"Outcome (rule-based): {rule_result}")
+else:
+    # Fall back to model
+    prediction = model.predict(X_scaled)[0]
+    print(f"Outcome (model): {id_to_label[prediction]}")
+```
+## Training Data
+- **Dataset:** `Allanatrix/articles` (194,183 HTML pages)
+- **Labeled samples:** 138,523 (weak labels from heuristics)
+- **Train/Val/Test split:** 110,819 / 13,852 / 13,852
+- **Class distribution:** 85% non-articles, 10% full articles, 4% errors, 1% partial
+## Model Details
+- **Algorithm:** XGBoost (GPU-trained)
+- **Features:** 26 hand-crafted features (HTML structure, keywords, densities)
+- **Training:** 500 boosting rounds with early stopping
+- **Hardware:** Single GPU (CUDA)
+- **Training time:** ~6 minutes
+### Features Used
+- Content: length, whitespace ratio, digit/punct ratios
+- Keywords: error messages, article indicators, navigation text
+- Structure: paragraph, link, heading, script tag counts
+- Densities: links/KB, paragraphs/KB, scripts/KB
+## Limitations
+- Only analyzes first 64KB of HTML (meta tags must appear early)
+- Trained on weak labels (heuristic-based, not human-annotated)
+- `other_failure` class has minimal representation in training data
+- Optimized for English-language web pages
+## Intended Use
+**Primary use cases:**
+- Quality control for article extraction pipelines
+- Monitoring extraction API health (error detection)
+- Filtering non-article pages before processing
+- Analytics on extraction success rates
+**Not suitable for:**
+- Language detection
+- Content quality assessment
+- Paywall detection
+- Full content extraction
+## Model Card Authors
+Allanatrix