justext-classifier / README.md
MichaelR207's picture
Add model card
f9cb3d7 verified
|
Raw
History Blame Contribute Delete
3.28 kB
---
license: bsd-2-clause
library_name: justext
tags:
- boilerplate-removal
- web-extraction
- text-extraction
- justext
- fasttext
language:
- en
---
# justext-classifier
The learned paragraph classifier for the improved [jusText fork](https://github.com/XenonMolecule/jusText)
— a boilerplate-removal tool that extracts the main content from an HTML page and drops
navigation, sidebars, footers, and other chrome.
This repo hosts the **highest-quality tier**: a scikit-learn paragraph classifier whose
features are stacked with a [fastText](https://fasttext.cc/) keep-probability model. The fork
auto-downloads it on first use; you normally don't fetch it by hand.
## Files
| File | Size | What it is |
|---|--:|---|
| `general-ftstack.joblib` | ~9 MB | scikit-learn classifier (RandomForest over structural + text features) |
| `general_ft.bin` | ~770 MB | fastText char/word-ngram model providing the stacked keep-probability feature |
## Quality (general dev set: token ROUGE-L F1 / char Levenshtein similarity)
| Tier | F1 | Lev | Footprint |
|---|--:|--:|---|
| heuristic (no model) | 0.821 | 0.741 | none |
| bundled 3 MB sklearn | 0.866 | 0.795 | ships in the wheel |
| **this model (fastText stack)** | **0.886** | **0.823** | this repo (~780 MB) |
For reference, stock upstream jusText scores ~0.76 on the same set; the fork's structural
fixes (forum/FAQ/comment role-transforms, code formatting, URL/mojibake repair, …) lift every
tier well above that even before the learned classifier.
## Usage
Install the fork with the fastText extra, then just call `justext` — the model is fetched and
cached (`~/.cache/justext`) on first use:
```bash
pip install "jusText[fasttext] @ git+https://github.com/XenonMolecule/jusText"
```
```python
import justext
stoplist = justext.get_stoplist("English")
paragraphs = justext.justext(html, stoplist) # auto-uses this model
content = "\n\n".join(p.text for p in paragraphs if not p.is_boilerplate)
```
Fetch explicitly (e.g. to pre-warm the cache):
```python
import justext
joblib_path, fasttext_path = justext.download_fasttext() # pulls both files here
```
### Tier / behaviour knobs (environment variables)
| Variable | Effect |
|---|---|
| `JUSTEXT_MODEL` | `fasttext` \| `sklearn` \| `heuristic` \| `auto` (default) |
| `JUSTEXT_NO_DOWNLOAD` | set to skip the download and use the bundled 3 MB model |
| `JUSTEXT_HF_REPO` | point at a different repo (default `MichaelR207/justext-classifier`) |
| `JUSTEXT_CACHE` | override the download cache directory |
If `fasttext` isn't installed, or the download fails, the fork degrades gracefully to the
bundled 3 MB model and then to the heuristic classifier — it always works offline.
## How it was trained
- **Structural classifier**: RandomForest over jusText's per-paragraph heuristic features
(link density, stopword density, length, tag context) plus neighbour signals.
- **Stacked text model**: a fastText classifier trained on ~100k labelled paragraphs; its
keep-probability is appended as a feature to the structural model.
- Tuned on an LLM-distilled main-content extraction benchmark (`general` split).
## License
BSD 2-Clause, same as jusText. See the [fork repository](https://github.com/XenonMolecule/jusText).