Instructions to use MichaelR207/justext-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use MichaelR207/justext-classifier with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("MichaelR207/justext-classifier", "model.bin")) - Notebooks
- Google Colab
- Kaggle
| license: bsd-2-clause | |
| library_name: justext | |
| tags: | |
| - boilerplate-removal | |
| - web-extraction | |
| - text-extraction | |
| - justext | |
| - fasttext | |
| language: | |
| - en | |
| # justext-classifier | |
| The learned paragraph classifier for the improved [jusText fork](https://github.com/XenonMolecule/jusText) | |
| — a boilerplate-removal tool that extracts the main content from an HTML page and drops | |
| navigation, sidebars, footers, and other chrome. | |
| This repo hosts the **highest-quality tier**: a scikit-learn paragraph classifier whose | |
| features are stacked with a [fastText](https://fasttext.cc/) keep-probability model. The fork | |
| auto-downloads it on first use; you normally don't fetch it by hand. | |
| ## Files | |
| | File | Size | What it is | | |
| |---|--:|---| | |
| | `general-ftstack.joblib` | ~9 MB | scikit-learn classifier (RandomForest over structural + text features) | | |
| | `general_ft.bin` | ~770 MB | fastText char/word-ngram model providing the stacked keep-probability feature | | |
| ## Quality (general dev set: token ROUGE-L F1 / char Levenshtein similarity) | |
| | Tier | F1 | Lev | Footprint | | |
| |---|--:|--:|---| | |
| | heuristic (no model) | 0.821 | 0.741 | none | | |
| | bundled 3 MB sklearn | 0.866 | 0.795 | ships in the wheel | | |
| | **this model (fastText stack)** | **0.886** | **0.823** | this repo (~780 MB) | | |
| For reference, stock upstream jusText scores ~0.76 on the same set; the fork's structural | |
| fixes (forum/FAQ/comment role-transforms, code formatting, URL/mojibake repair, …) lift every | |
| tier well above that even before the learned classifier. | |
| ## Usage | |
| Install the fork with the fastText extra, then just call `justext` — the model is fetched and | |
| cached (`~/.cache/justext`) on first use: | |
| ```bash | |
| pip install "jusText[fasttext] @ git+https://github.com/XenonMolecule/jusText" | |
| ``` | |
| ```python | |
| import justext | |
| stoplist = justext.get_stoplist("English") | |
| paragraphs = justext.justext(html, stoplist) # auto-uses this model | |
| content = "\n\n".join(p.text for p in paragraphs if not p.is_boilerplate) | |
| ``` | |
| Fetch explicitly (e.g. to pre-warm the cache): | |
| ```python | |
| import justext | |
| joblib_path, fasttext_path = justext.download_fasttext() # pulls both files here | |
| ``` | |
| ### Tier / behaviour knobs (environment variables) | |
| | Variable | Effect | | |
| |---|---| | |
| | `JUSTEXT_MODEL` | `fasttext` \| `sklearn` \| `heuristic` \| `auto` (default) | | |
| | `JUSTEXT_NO_DOWNLOAD` | set to skip the download and use the bundled 3 MB model | | |
| | `JUSTEXT_HF_REPO` | point at a different repo (default `MichaelR207/justext-classifier`) | | |
| | `JUSTEXT_CACHE` | override the download cache directory | | |
| If `fasttext` isn't installed, or the download fails, the fork degrades gracefully to the | |
| bundled 3 MB model and then to the heuristic classifier — it always works offline. | |
| ## How it was trained | |
| - **Structural classifier**: RandomForest over jusText's per-paragraph heuristic features | |
| (link density, stopword density, length, tag context) plus neighbour signals. | |
| - **Stacked text model**: a fastText classifier trained on ~100k labelled paragraphs; its | |
| keep-probability is appended as a feature to the structural model. | |
| - Tuned on an LLM-distilled main-content extraction benchmark (`general` split). | |
| ## License | |
| BSD 2-Clause, same as jusText. See the [fork repository](https://github.com/XenonMolecule/jusText). | |