--- license: bsd-2-clause library_name: justext tags: - boilerplate-removal - web-extraction - text-extraction - justext - fasttext language: - en --- # justext-classifier The learned paragraph classifier for the improved [jusText fork](https://github.com/XenonMolecule/jusText) — a boilerplate-removal tool that extracts the main content from an HTML page and drops navigation, sidebars, footers, and other chrome. This repo hosts the **highest-quality tier**: a scikit-learn paragraph classifier whose features are stacked with a [fastText](https://fasttext.cc/) keep-probability model. The fork auto-downloads it on first use; you normally don't fetch it by hand. ## Files | File | Size | What it is | |---|--:|---| | `general-ftstack.joblib` | ~9 MB | scikit-learn classifier (RandomForest over structural + text features) | | `general_ft.bin` | ~770 MB | fastText char/word-ngram model providing the stacked keep-probability feature | ## Quality (general dev set: token ROUGE-L F1 / char Levenshtein similarity) | Tier | F1 | Lev | Footprint | |---|--:|--:|---| | heuristic (no model) | 0.821 | 0.741 | none | | bundled 3 MB sklearn | 0.866 | 0.795 | ships in the wheel | | **this model (fastText stack)** | **0.886** | **0.823** | this repo (~780 MB) | For reference, stock upstream jusText scores ~0.76 on the same set; the fork's structural fixes (forum/FAQ/comment role-transforms, code formatting, URL/mojibake repair, …) lift every tier well above that even before the learned classifier. ## Usage Install the fork with the fastText extra, then just call `justext` — the model is fetched and cached (`~/.cache/justext`) on first use: ```bash pip install "jusText[fasttext] @ git+https://github.com/XenonMolecule/jusText" ``` ```python import justext stoplist = justext.get_stoplist("English") paragraphs = justext.justext(html, stoplist) # auto-uses this model content = "\n\n".join(p.text for p in paragraphs if not p.is_boilerplate) ``` Fetch explicitly (e.g. to pre-warm the cache): ```python import justext joblib_path, fasttext_path = justext.download_fasttext() # pulls both files here ``` ### Tier / behaviour knobs (environment variables) | Variable | Effect | |---|---| | `JUSTEXT_MODEL` | `fasttext` \| `sklearn` \| `heuristic` \| `auto` (default) | | `JUSTEXT_NO_DOWNLOAD` | set to skip the download and use the bundled 3 MB model | | `JUSTEXT_HF_REPO` | point at a different repo (default `MichaelR207/justext-classifier`) | | `JUSTEXT_CACHE` | override the download cache directory | If `fasttext` isn't installed, or the download fails, the fork degrades gracefully to the bundled 3 MB model and then to the heuristic classifier — it always works offline. ## How it was trained - **Structural classifier**: RandomForest over jusText's per-paragraph heuristic features (link density, stopword density, length, tag context) plus neighbour signals. - **Stacked text model**: a fastText classifier trained on ~100k labelled paragraphs; its keep-probability is appended as a feature to the structural model. - Tuned on an LLM-distilled main-content extraction benchmark (`general` split). ## License BSD 2-Clause, same as jusText. See the [fork repository](https://github.com/XenonMolecule/jusText).