Allanatrix commited on
Commit
e67da43
·
verified ·
1 Parent(s): 97e6c9e

Create README.MD

Browse files
Files changed (1) hide show
  1. README.MD +201 -0
README.MD ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - text-classification
7
+ - html-analysis
8
+ - article-extraction
9
+ - xgboost
10
+ - web-scraping
11
+ datasets:
12
+ - Allanatrix/articles
13
+ metrics:
14
+ - accuracy
15
+ - f1
16
+ library_name: xgboost
17
+ ---
18
+
19
+ # Article Extraction Outcome Classifier
20
+
21
+ A fast, lightweight classifier that categorizes web article extraction outcomes with 99.99% accuracy.
22
+
23
+ ## Model Description
24
+
25
+ This model predicts whether HTML extraction succeeded, failed, or returned a non-article page. It combines rule-based heuristics for speed with XGBoost for accuracy on ambiguous cases.
26
+
27
+ **Key Features:**
28
+ - Processes only first 64KB of HTML for speed
29
+ - 99.99% accuracy on test set
30
+ - Rule-based fast path handles 80%+ of cases instantly
31
+ - Only 26 hand-crafted features (no large embeddings)
32
+
33
+ ## Classes
34
+
35
+ | Class | Description |
36
+ |-------|-------------|
37
+ | `full_article_extracted` | Complete article successfully extracted |
38
+ | `partial_article_extracted` | Article partially extracted (incomplete) |
39
+ | `api_provider_error` | External API/service failure |
40
+ | `other_failure` | Low-confidence failure (catch-all) |
41
+ | `full_page_not_article` | Page is not an article (nav, homepage, etc.) |
42
+
43
+ ## Performance
44
+
45
+ **Test Set Results (13,852 samples):**
46
+
47
+ ```
48
+ Overall Accuracy: 99.99%
49
+ Macro F1: 0.7976
50
+
51
+ precision recall f1-score support
52
+ full_article_extracted 0.9985 1.0000 0.9992 1312
53
+ partial_article_extracted 1.0000 0.9783 0.9890 92
54
+ api_provider_error 1.0000 1.0000 1.0000 627
55
+ other_failure 0.0000 0.0000 0.0000 0
56
+ full_page_not_article 1.0000 1.0000 1.0000 11821
57
+ ```
58
+
59
+ ## Usage
60
+
61
+ ```python
62
+ import numpy as np
63
+ import torch
64
+ from sklearn.preprocessing import StandardScaler
65
+
66
+ # Load model
67
+ artifacts = torch.load("artifacts.pt")
68
+ scaler = artifacts["scaler"]
69
+ model = artifacts["xgb_model"]
70
+ id_to_label = artifacts["id_to_label"]
71
+
72
+ # Extract features (26 features from HTML prefix)
73
+ def extract_features(html: str, max_chars: int = 64000) -> dict:
74
+ prefix = html[:max_chars].lower()
75
+
76
+ features = {
77
+ "length_chars": len(html),
78
+ "prefix_len": len(prefix),
79
+ "ws_ratio": sum(c.isspace() for c in prefix) / len(prefix) if prefix else 0,
80
+ "digit_ratio": sum(c.isdigit() for c in prefix) / len(prefix) if prefix else 0,
81
+ "punct_ratio": sum(c in ".,;:!?" for c in prefix) / len(prefix) if prefix else 0,
82
+ # Keyword counts
83
+ "cookie": prefix.count("cookie") + prefix.count("consent"),
84
+ "subscribe": prefix.count("subscribe") + prefix.count("newsletter"),
85
+ "legal": prefix.count("privacy policy") + prefix.count("terms of service"),
86
+ "error": prefix.count("error") + prefix.count("timeout") + prefix.count("rate limit"),
87
+ "nav": prefix.count("home") + prefix.count("menu") + prefix.count("navigation"),
88
+ "article_kw": prefix.count("published") + prefix.count("reading time"),
89
+ "meta_article_kw": prefix.count("og:article") + prefix.count("article:published"),
90
+ # Tag counts
91
+ "n_p": prefix.count("<p"),
92
+ "n_a": prefix.count("<a"),
93
+ "n_h1": prefix.count("<h1"),
94
+ "n_h2": prefix.count("<h2"),
95
+ "n_h3": prefix.count("<h3"),
96
+ "n_article": prefix.count("<article"),
97
+ "n_main": prefix.count("<main"),
98
+ "n_time": prefix.count("<time"),
99
+ "n_script": prefix.count("<script"),
100
+ "n_style": prefix.count("<style"),
101
+ "n_nav": prefix.count("<nav"),
102
+ }
103
+
104
+ # Density features
105
+ kb = len(prefix) / 1000.0
106
+ features["link_density"] = features["n_a"] / kb if kb > 0 else 0
107
+ features["para_density"] = features["n_p"] / kb if kb > 0 else 0
108
+ features["script_density"] = features["n_script"] / kb if kb > 0 else 0
109
+ features["heading_score"] = features["n_h1"] * 3 + features["n_h2"] * 2 + features["n_h3"]
110
+
111
+ return features
112
+
113
+ # Predict
114
+ features = extract_features(html_string)
115
+ NUM_COLS = ["length_chars", "prefix_len", "ws_ratio", "digit_ratio", "punct_ratio",
116
+ "cookie", "subscribe", "legal", "error", "nav", "article_kw", "meta_article_kw",
117
+ "n_p", "n_a", "n_h1", "n_h2", "n_h3", "n_article", "n_main", "n_time",
118
+ "n_script", "n_style", "n_nav", "link_density", "para_density",
119
+ "script_density", "heading_score"]
120
+
121
+ X = np.array([features[col] for col in NUM_COLS]).reshape(1, -1).astype(np.float32)
122
+ X_scaled = scaler.transform(X)
123
+ prediction = model.predict(X_scaled)[0]
124
+
125
+ print(f"Outcome: {id_to_label[prediction]}")
126
+ ```
127
+
128
+ ### Optional: Rule-Based Fast Path
129
+
130
+ For 80%+ of cases, you can skip the model entirely:
131
+
132
+ ```python
133
+ def apply_rules(features: dict) -> str | None:
134
+ """Returns class label or None if ambiguous."""
135
+ if features["error"] >= 3:
136
+ return "api_provider_error"
137
+
138
+ if features["meta_article_kw"] >= 2 and features["n_p"] >= 10:
139
+ return "full_article_extracted"
140
+
141
+ if features["nav"] >= 5 and features["n_p"] < 5 and features["link_density"] > 20:
142
+ return "full_page_not_article"
143
+
144
+ return None # Use ML model
145
+
146
+ # Try rules first
147
+ rule_result = apply_rules(features)
148
+ if rule_result:
149
+ print(f"Outcome (rule-based): {rule_result}")
150
+ else:
151
+ # Fall back to model
152
+ prediction = model.predict(X_scaled)[0]
153
+ print(f"Outcome (model): {id_to_label[prediction]}")
154
+ ```
155
+
156
+ ## Training Data
157
+
158
+ - **Dataset:** `Allanatrix/articles` (194,183 HTML pages)
159
+ - **Labeled samples:** 138,523 (weak labels from heuristics)
160
+ - **Train/Val/Test split:** 110,819 / 13,852 / 13,852
161
+ - **Class distribution:** 85% non-articles, 10% full articles, 4% errors, 1% partial
162
+
163
+ ## Model Details
164
+
165
+ - **Algorithm:** XGBoost (GPU-trained)
166
+ - **Features:** 26 hand-crafted features (HTML structure, keywords, densities)
167
+ - **Training:** 500 boosting rounds with early stopping
168
+ - **Hardware:** Single GPU (CUDA)
169
+ - **Training time:** ~6 minutes
170
+
171
+ ### Features Used
172
+
173
+ - Content: length, whitespace ratio, digit/punct ratios
174
+ - Keywords: error messages, article indicators, navigation text
175
+ - Structure: paragraph, link, heading, script tag counts
176
+ - Densities: links/KB, paragraphs/KB, scripts/KB
177
+
178
+ ## Limitations
179
+
180
+ - Only analyzes first 64KB of HTML (meta tags must appear early)
181
+ - Trained on weak labels (heuristic-based, not human-annotated)
182
+ - `other_failure` class has minimal representation in training data
183
+ - Optimized for English-language web pages
184
+
185
+ ## Intended Use
186
+
187
+ **Primary use cases:**
188
+ - Quality control for article extraction pipelines
189
+ - Monitoring extraction API health (error detection)
190
+ - Filtering non-article pages before processing
191
+ - Analytics on extraction success rates
192
+
193
+ **Not suitable for:**
194
+ - Language detection
195
+ - Content quality assessment
196
+ - Paywall detection
197
+ - Full content extraction
198
+
199
+ ## Model Card Authors
200
+
201
+ Allanatrix