Instructions to use MichaelR207/justext-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use MichaelR207/justext-classifier with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("MichaelR207/justext-classifier", "model.bin")) - Notebooks
- Google Colab
- Kaggle
justext-classifier
The learned paragraph classifier for the improved jusText fork โ a boilerplate-removal tool that extracts the main content from an HTML page and drops navigation, sidebars, footers, and other chrome.
This repo hosts the highest-quality tier: a scikit-learn paragraph classifier whose features are stacked with a fastText keep-probability model. The fork auto-downloads it on first use; you normally don't fetch it by hand.
Files
| File | Size | What it is |
|---|---|---|
general-ftstack.joblib |
~9 MB | scikit-learn classifier (RandomForest over structural + text features) |
general_ft.bin |
~770 MB | fastText char/word-ngram model providing the stacked keep-probability feature |
Quality (general dev set: token ROUGE-L F1 / char Levenshtein similarity)
| Tier | F1 | Lev | Footprint |
|---|---|---|---|
| heuristic (no model) | 0.821 | 0.741 | none |
| bundled 3 MB sklearn | 0.866 | 0.795 | ships in the wheel |
| this model (fastText stack) | 0.886 | 0.823 | this repo (~780 MB) |
For reference, stock upstream jusText scores ~0.76 on the same set; the fork's structural fixes (forum/FAQ/comment role-transforms, code formatting, URL/mojibake repair, โฆ) lift every tier well above that even before the learned classifier.
Usage
Install the fork with the fastText extra, then just call justext โ the model is fetched and
cached (~/.cache/justext) on first use:
pip install "jusText[fasttext] @ git+https://github.com/XenonMolecule/jusText"
import justext
stoplist = justext.get_stoplist("English")
paragraphs = justext.justext(html, stoplist) # auto-uses this model
content = "\n\n".join(p.text for p in paragraphs if not p.is_boilerplate)
Fetch explicitly (e.g. to pre-warm the cache):
import justext
joblib_path, fasttext_path = justext.download_fasttext() # pulls both files here
Tier / behaviour knobs (environment variables)
| Variable | Effect |
|---|---|
JUSTEXT_MODEL |
fasttext | sklearn | heuristic | auto (default) |
JUSTEXT_NO_DOWNLOAD |
set to skip the download and use the bundled 3 MB model |
JUSTEXT_HF_REPO |
point at a different repo (default MichaelR207/justext-classifier) |
JUSTEXT_CACHE |
override the download cache directory |
If fasttext isn't installed, or the download fails, the fork degrades gracefully to the
bundled 3 MB model and then to the heuristic classifier โ it always works offline.
How it was trained
- Structural classifier: RandomForest over jusText's per-paragraph heuristic features (link density, stopword density, length, tag context) plus neighbour signals.
- Stacked text model: a fastText classifier trained on ~100k labelled paragraphs; its keep-probability is appended as a feature to the structural model.
- Tuned on an LLM-distilled main-content extraction benchmark (
generalsplit).
License
BSD 2-Clause, same as jusText. See the fork repository.