File size: 7,153 Bytes
e67da43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a05700e
e67da43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a05700e
e67da43
a05700e
 
 
 
 
 
 
 
 
 
e67da43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70cfc92
 
 
 
e67da43
70cfc92
e67da43
 
 
 
70cfc92
e67da43
 
 
 
 
 
70cfc92
 
 
 
e67da43
 
 
70cfc92
 
 
 
e67da43
 
 
 
 
70cfc92
 
 
e67da43
 
 
 
 
 
 
 
 
70cfc92
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
language:
- en
license: mit
tags:
- text-classification
- html-analysis
- article-extraction
- xgboost
- web-scraping
datasets:
- Allanatrix/articles
metrics:
- accuracy
- f1
library_name: xgboost
---

# Article Extraction Outcome Classifier

A fast, lightweight classifier that categorizes web article extraction outcomes with 90% accuarcy

## Model Description

This model predicts whether HTML extraction succeeded, failed, or returned a non-article page. It combines rule-based heuristics for speed with XGBoost for accuracy on ambiguous cases.

## Classes

| Class | Description |
|-------|-------------|
| `full_article_extracted` | Complete article successfully extracted |
| `partial_article_extracted` | Article partially extracted (incomplete) |
| `api_provider_error` | External API/service failure |
| `other_failure` | Low-confidence failure (catch-all) |
| `full_page_not_article` | Page is not an article (nav, homepage, etc.) |

## Performance

~90% accuracy on a large, real-world test set, with strong performance on dominant classes

| Class                     | Precision | Recall | F1-score | Support |
| ------------------------- | --------- | ------ | -------- | ------- |
| full_article_extracted    | 0.91      | 0.84   | 0.87     | 1,312   |
| partial_article_extracted | 0.76      | 0.63   | 0.69     | 92      |
| api_provider_error        | 0.95      | 0.93   | 0.94     | 627     |
| other_failure             | 0.41      | 0.28   | 0.33     | 44      |
| full_page_not_article     | 0.92      | 0.97   | 0.94     | 11,821  |
| **Accuracy**              | —         | —      | **0.90** | 13,852  |
| **Macro Avg**             | 0.79      | 0.73   | 0.72     | 13,852  |
| **Weighted Avg**          | 0.90      | 0.90   | 0.90     | 13,852  |


```python
import numpy as np
import torch
from sklearn.preprocessing import StandardScaler

# Load model
artifacts = torch.load("artifacts.pt")
scaler = artifacts["scaler"]
model = artifacts["xgb_model"]
id_to_label = artifacts["id_to_label"]

# Extract features (26 features from HTML prefix)
def extract_features(html: str, max_chars: int = 64000) -> dict:
    prefix = html[:max_chars].lower()
    
    features = {
        "length_chars": len(html),
        "prefix_len": len(prefix),
        "ws_ratio": sum(c.isspace() for c in prefix) / len(prefix) if prefix else 0,
        "digit_ratio": sum(c.isdigit() for c in prefix) / len(prefix) if prefix else 0,
        "punct_ratio": sum(c in ".,;:!?" for c in prefix) / len(prefix) if prefix else 0,
        # Keyword counts
        "cookie": prefix.count("cookie") + prefix.count("consent"),
        "subscribe": prefix.count("subscribe") + prefix.count("newsletter"),
        "legal": prefix.count("privacy policy") + prefix.count("terms of service"),
        "error": prefix.count("error") + prefix.count("timeout") + prefix.count("rate limit"),
        "nav": prefix.count("home") + prefix.count("menu") + prefix.count("navigation"),
        "article_kw": prefix.count("published") + prefix.count("reading time"),
        "meta_article_kw": prefix.count("og:article") + prefix.count("article:published"),
        # Tag counts
        "n_p": prefix.count("<p"),
        "n_a": prefix.count("<a"),
        "n_h1": prefix.count("<h1"),
        "n_h2": prefix.count("<h2"),
        "n_h3": prefix.count("<h3"),
        "n_article": prefix.count("<article"),
        "n_main": prefix.count("<main"),
        "n_time": prefix.count("<time"),
        "n_script": prefix.count("<script"),
        "n_style": prefix.count("<style"),
        "n_nav": prefix.count("<nav"),
    }
    
    # Density features
    kb = len(prefix) / 1000.0
    features["link_density"] = features["n_a"] / kb if kb > 0 else 0
    features["para_density"] = features["n_p"] / kb if kb > 0 else 0
    features["script_density"] = features["n_script"] / kb if kb > 0 else 0
    features["heading_score"] = features["n_h1"] * 3 + features["n_h2"] * 2 + features["n_h3"]
    
    return features

# Predict
features = extract_features(html_string)
NUM_COLS = ["length_chars", "prefix_len", "ws_ratio", "digit_ratio", "punct_ratio",
            "cookie", "subscribe", "legal", "error", "nav", "article_kw", "meta_article_kw",
            "n_p", "n_a", "n_h1", "n_h2", "n_h3", "n_article", "n_main", "n_time",
            "n_script", "n_style", "n_nav", "link_density", "para_density", 
            "script_density", "heading_score"]

X = np.array([features[col] for col in NUM_COLS]).reshape(1, -1).astype(np.float32)
X_scaled = scaler.transform(X)
prediction = model.predict(X_scaled)[0]

print(f"Outcome: {id_to_label[prediction]}")
```

### Optional: Rule-Based Fast Path

For 80%+ of cases, you can skip the model entirely:

```python
def apply_rules(features: dict) -> str | None:
    """Returns class label or None if ambiguous."""
    if features["error"] >= 3:
        return "api_provider_error"
    
    if features["meta_article_kw"] >= 2 and features["n_p"] >= 10:
        return "full_article_extracted"
    
    if features["nav"] >= 5 and features["n_p"] < 5 and features["link_density"] > 20:
        return "full_page_not_article"
    
    return None  # Use ML model

# Try rules first
rule_result = apply_rules(features)
if rule_result:
    print(f"Outcome (rule-based): {rule_result}")
else:
    # Fall back to model
    prediction = model.predict(X_scaled)[0]
    print(f"Outcome (model): {id_to_label[prediction]}")
```

## Training Data

- **Dataset:** `Allanatrix/articles` (194,183 HTML pages)
- **Labeled samples:** 138,523 (LLM-labeled)
- **Labeling method:** Distillation from large language models
  - **Primary teacher:** GPT-5
  - **Secondary / adjudicator:** Qwen
- **Train/Val/Test split:** 110,819 / 13,852 / 13,852
- **Class distribution:** ~85% non-articles, ~10% full articles, ~4% errors, ~1% partial articles

## Model Details

- **Algorithm:** XGBoost (GPU-trained)
- **Features:** 26 hand-crafted features (HTML structure, keyword counts, density metrics)
- **Training:** 500 boosting rounds with early stopping
- **Hardware:** Single GPU (CUDA)
- **Training time:** ~6 minutes

### Features Used

- **Content statistics:** length, whitespace ratio, digit and punctuation ratios
- **Keyword signals:** error messages, article indicators, navigation text
- **HTML structure:** paragraph, link, heading, script, style, and nav tag counts
- **Density metrics:** links/KB, paragraphs/KB, scripts/KB, heading score

## Limitations

- Only analyzes the first 64KB of HTML (important metadata must appear early)
- Labels are generated by LLMs rather than direct human annotation
- Some classes (e.g. `other_failure`) have limited representation
- Optimized primarily for English-language web pages

## Intended Use

**Primary use cases:**
- Quality control for article extraction pipelines
- Monitoring extraction API health and failure modes
- Fast filtering of non-article pages before downstream processing
- Analytics on extraction success and failure rates

**Not suitable for:**
- Language detection
- Content quality assessment
- Paywall detection
- Full content extraction

## Model Card Authors

Allanatrix