Update README.md

a05700e verified 29 days ago

7.15 kB

	---
	language:
	- en
	license: mit
	tags:
	- text-classification
	- html-analysis
	- article-extraction
	- xgboost
	- web-scraping
	datasets:
	- Allanatrix/articles
	metrics:
	- accuracy
	- f1
	library_name: xgboost
	---

	# Article Extraction Outcome Classifier

	A fast, lightweight classifier that categorizes web article extraction outcomes with 90% accuarcy

	## Model Description

	This model predicts whether HTML extraction succeeded, failed, or returned a non-article page. It combines rule-based heuristics for speed with XGBoost for accuracy on ambiguous cases.

	## Classes

	\| Class \| Description \|
	\|-------\|-------------\|
	\| `full_article_extracted` \| Complete article successfully extracted \|
	\| `partial_article_extracted` \| Article partially extracted (incomplete) \|
	\| `api_provider_error` \| External API/service failure \|
	\| `other_failure` \| Low-confidence failure (catch-all) \|
	\| `full_page_not_article` \| Page is not an article (nav, homepage, etc.) \|

	## Performance

	~90% accuracy on a large, real-world test set, with strong performance on dominant classes

	\| Class \| Precision \| Recall \| F1-score \| Support \|
	\| ------------------------- \| --------- \| ------ \| -------- \| ------- \|
	\| full_article_extracted \| 0.91 \| 0.84 \| 0.87 \| 1,312 \|
	\| partial_article_extracted \| 0.76 \| 0.63 \| 0.69 \| 92 \|
	\| api_provider_error \| 0.95 \| 0.93 \| 0.94 \| 627 \|
	\| other_failure \| 0.41 \| 0.28 \| 0.33 \| 44 \|
	\| full_page_not_article \| 0.92 \| 0.97 \| 0.94 \| 11,821 \|
	\| Accuracy \| — \| — \| 0.90 \| 13,852 \|
	\| Macro Avg \| 0.79 \| 0.73 \| 0.72 \| 13,852 \|
	\| Weighted Avg \| 0.90 \| 0.90 \| 0.90 \| 13,852 \|


	```python
	import numpy as np
	import torch
	from sklearn.preprocessing import StandardScaler

	# Load model
	artifacts = torch.load("artifacts.pt")
	scaler = artifacts["scaler"]
	model = artifacts["xgb_model"]
	id_to_label = artifacts["id_to_label"]

	# Extract features (26 features from HTML prefix)
	def extract_features(html: str, max_chars: int = 64000) -> dict:
	prefix = html[:max_chars].lower()

	features = {
	"length_chars": len(html),
	"prefix_len": len(prefix),
	"ws_ratio": sum(c.isspace() for c in prefix) / len(prefix) if prefix else 0,
	"digit_ratio": sum(c.isdigit() for c in prefix) / len(prefix) if prefix else 0,
	"punct_ratio": sum(c in ".,;:!?" for c in prefix) / len(prefix) if prefix else 0,
	# Keyword counts
	"cookie": prefix.count("cookie") + prefix.count("consent"),
	"subscribe": prefix.count("subscribe") + prefix.count("newsletter"),
	"legal": prefix.count("privacy policy") + prefix.count("terms of service"),
	"error": prefix.count("error") + prefix.count("timeout") + prefix.count("rate limit"),
	"nav": prefix.count("home") + prefix.count("menu") + prefix.count("navigation"),
	"article_kw": prefix.count("published") + prefix.count("reading time"),
	"meta_article_kw": prefix.count("og:article") + prefix.count("article:published"),
	# Tag counts
	"n_p": prefix.count("<p"),
	"n_a": prefix.count("<a"),
	"n_h1": prefix.count("<h1"),
	"n_h2": prefix.count("<h2"),
	"n_h3": prefix.count("<h3"),
	"n_article": prefix.count("<article"),
	"n_main": prefix.count("<main"),
	"n_time": prefix.count("<time"),
	"n_script": prefix.count("<script"),
	"n_style": prefix.count("<style"),
	"n_nav": prefix.count("<nav"),
	}

	# Density features
	kb = len(prefix) / 1000.0
	features["link_density"] = features["n_a"] / kb if kb > 0 else 0
	features["para_density"] = features["n_p"] / kb if kb > 0 else 0
	features["script_density"] = features["n_script"] / kb if kb > 0 else 0
	features["heading_score"] = features["n_h1"] * 3 + features["n_h2"] * 2 + features["n_h3"]

	return features

	# Predict
	features = extract_features(html_string)
	NUM_COLS = ["length_chars", "prefix_len", "ws_ratio", "digit_ratio", "punct_ratio",
	"cookie", "subscribe", "legal", "error", "nav", "article_kw", "meta_article_kw",
	"n_p", "n_a", "n_h1", "n_h2", "n_h3", "n_article", "n_main", "n_time",
	"n_script", "n_style", "n_nav", "link_density", "para_density",
	"script_density", "heading_score"]

	X = np.array([features[col] for col in NUM_COLS]).reshape(1, -1).astype(np.float32)
	X_scaled = scaler.transform(X)
	prediction = model.predict(X_scaled)[0]

	print(f"Outcome: {id_to_label[prediction]}")
	```

	### Optional: Rule-Based Fast Path

	For 80%+ of cases, you can skip the model entirely:

	```python
	def apply_rules(features: dict) -> str \| None:
	"""Returns class label or None if ambiguous."""
	if features["error"] >= 3:
	return "api_provider_error"

	if features["meta_article_kw"] >= 2 and features["n_p"] >= 10:
	return "full_article_extracted"

	if features["nav"] >= 5 and features["n_p"] < 5 and features["link_density"] > 20:
	return "full_page_not_article"

	return None # Use ML model

	# Try rules first
	rule_result = apply_rules(features)
	if rule_result:
	print(f"Outcome (rule-based): {rule_result}")
	else:
	# Fall back to model
	prediction = model.predict(X_scaled)[0]
	print(f"Outcome (model): {id_to_label[prediction]}")
	```

	## Training Data

	- Dataset: `Allanatrix/articles` (194,183 HTML pages)
	- Labeled samples: 138,523 (LLM-labeled)
	- Labeling method: Distillation from large language models
	- Primary teacher: GPT-5
	- Secondary / adjudicator: Qwen
	- Train/Val/Test split: 110,819 / 13,852 / 13,852
	- Class distribution: ~85% non-articles, ~10% full articles, ~4% errors, ~1% partial articles

	## Model Details

	- Algorithm: XGBoost (GPU-trained)
	- Features: 26 hand-crafted features (HTML structure, keyword counts, density metrics)
	- Training: 500 boosting rounds with early stopping
	- Hardware: Single GPU (CUDA)
	- Training time: ~6 minutes

	### Features Used

	- Content statistics: length, whitespace ratio, digit and punctuation ratios
	- Keyword signals: error messages, article indicators, navigation text
	- HTML structure: paragraph, link, heading, script, style, and nav tag counts
	- Density metrics: links/KB, paragraphs/KB, scripts/KB, heading score

	## Limitations

	- Only analyzes the first 64KB of HTML (important metadata must appear early)
	- Labels are generated by LLMs rather than direct human annotation
	- Some classes (e.g. `other_failure`) have limited representation
	- Optimized primarily for English-language web pages

	## Intended Use

	Primary use cases:
	- Quality control for article extraction pipelines
	- Monitoring extraction API health and failure modes
	- Fast filtering of non-article pages before downstream processing
	- Analytics on extraction success and failure rates

	Not suitable for:
	- Language detection
	- Content quality assessment
	- Paywall detection
	- Full content extraction

	## Model Card Authors

	Allanatrix