justext-classifier

The learned paragraph classifier for the improved jusText fork โ€” a boilerplate-removal tool that extracts the main content from an HTML page and drops navigation, sidebars, footers, and other chrome.

This repo hosts the highest-quality tier: a scikit-learn paragraph classifier whose features are stacked with a fastText keep-probability model. The fork auto-downloads it on first use; you normally don't fetch it by hand.

Files

File Size What it is
general-ftstack.joblib ~9 MB scikit-learn classifier (RandomForest over structural + text features)
general_ft.bin ~770 MB fastText char/word-ngram model providing the stacked keep-probability feature

Quality (general dev set: token ROUGE-L F1 / char Levenshtein similarity)

Tier F1 Lev Footprint
heuristic (no model) 0.821 0.741 none
bundled 3 MB sklearn 0.866 0.795 ships in the wheel
this model (fastText stack) 0.886 0.823 this repo (~780 MB)

For reference, stock upstream jusText scores ~0.76 on the same set; the fork's structural fixes (forum/FAQ/comment role-transforms, code formatting, URL/mojibake repair, โ€ฆ) lift every tier well above that even before the learned classifier.

Usage

Install the fork with the fastText extra, then just call justext โ€” the model is fetched and cached (~/.cache/justext) on first use:

pip install "jusText[fasttext] @ git+https://github.com/XenonMolecule/jusText"
import justext

stoplist = justext.get_stoplist("English")
paragraphs = justext.justext(html, stoplist)          # auto-uses this model
content = "\n\n".join(p.text for p in paragraphs if not p.is_boilerplate)

Fetch explicitly (e.g. to pre-warm the cache):

import justext
joblib_path, fasttext_path = justext.download_fasttext()   # pulls both files here

Tier / behaviour knobs (environment variables)

Variable Effect
JUSTEXT_MODEL fasttext | sklearn | heuristic | auto (default)
JUSTEXT_NO_DOWNLOAD set to skip the download and use the bundled 3 MB model
JUSTEXT_HF_REPO point at a different repo (default MichaelR207/justext-classifier)
JUSTEXT_CACHE override the download cache directory

If fasttext isn't installed, or the download fails, the fork degrades gracefully to the bundled 3 MB model and then to the heuristic classifier โ€” it always works offline.

How it was trained

  • Structural classifier: RandomForest over jusText's per-paragraph heuristic features (link density, stopword density, length, tag context) plus neighbour signals.
  • Stacked text model: a fastText classifier trained on ~100k labelled paragraphs; its keep-probability is appended as a feature to the structural model.
  • Tuned on an LLM-distilled main-content extraction benchmark (general split).

License

BSD 2-Clause, same as jusText. See the fork repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support