justext-classifier

The learned paragraph classifier for the improved jusText fork — a boilerplate-removal tool that extracts the main content from an HTML page and drops navigation, sidebars, footers, and other chrome.

This repo hosts the highest-quality tier: a scikit-learn paragraph classifier whose features are stacked with a fastText keep-probability model. The fork auto-downloads it on first use; you normally don't fetch it by hand.

Files

File	Size	What it is
`general-ftstack.joblib`	~9 MB	scikit-learn classifier (RandomForest over structural + text features)
`general_ft.bin`	~770 MB	fastText char/word-ngram model providing the stacked keep-probability feature

Quality (general dev set: token ROUGE-L F1 / char Levenshtein similarity)

Tier	F1	Lev	Footprint
heuristic (no model)	0.821	0.741	none
bundled 3 MB sklearn	0.866	0.795	ships in the wheel
this model (fastText stack)	0.886	0.823	this repo (~780 MB)

For reference, stock upstream jusText scores ~0.76 on the same set; the fork's structural fixes (forum/FAQ/comment role-transforms, code formatting, URL/mojibake repair, …) lift every tier well above that even before the learned classifier.

Usage

Install the fork with the fastText extra, then just call justext — the model is fetched and cached (~/.cache/justext) on first use:

pip install "jusText[fasttext] @ git+https://github.com/XenonMolecule/jusText"

import justext

stoplist = justext.get_stoplist("English")
paragraphs = justext.justext(html, stoplist)          # auto-uses this model
content = "\n\n".join(p.text for p in paragraphs if not p.is_boilerplate)

Fetch explicitly (e.g. to pre-warm the cache):

import justext
joblib_path, fasttext_path = justext.download_fasttext()   # pulls both files here

Tier / behaviour knobs (environment variables)

Variable	Effect
`JUSTEXT_MODEL`	`fasttext` \| `sklearn` \| `heuristic` \| `auto` (default)
`JUSTEXT_NO_DOWNLOAD`	set to skip the download and use the bundled 3 MB model
`JUSTEXT_HF_REPO`	point at a different repo (default `MichaelR207/justext-classifier`)
`JUSTEXT_CACHE`	override the download cache directory

If fasttext isn't installed, or the download fails, the fork degrades gracefully to the bundled 3 MB model and then to the heuristic classifier — it always works offline.

How it was trained

Structural classifier: RandomForest over jusText's per-paragraph heuristic features (link density, stopword density, length, tag context) plus neighbour signals.
Stacked text model: a fastText classifier trained on ~100k labelled paragraphs; its keep-probability is appended as a feature to the structural model.
Tuned on an LLM-distilled main-content extraction benchmark (general split).

License

BSD 2-Clause, same as jusText. See the fork repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support