Instructions to use MichaelR207/justext-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use MichaelR207/justext-classifier with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("MichaelR207/justext-classifier", "model.bin")) - Notebooks
- Google Colab
- Kaggle
Add model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: bsd-2-clause
|
| 3 |
+
library_name: justext
|
| 4 |
+
tags:
|
| 5 |
+
- boilerplate-removal
|
| 6 |
+
- web-extraction
|
| 7 |
+
- text-extraction
|
| 8 |
+
- justext
|
| 9 |
+
- fasttext
|
| 10 |
+
language:
|
| 11 |
+
- en
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# justext-classifier
|
| 15 |
+
|
| 16 |
+
The learned paragraph classifier for the improved [jusText fork](https://github.com/XenonMolecule/jusText)
|
| 17 |
+
— a boilerplate-removal tool that extracts the main content from an HTML page and drops
|
| 18 |
+
navigation, sidebars, footers, and other chrome.
|
| 19 |
+
|
| 20 |
+
This repo hosts the **highest-quality tier**: a scikit-learn paragraph classifier whose
|
| 21 |
+
features are stacked with a [fastText](https://fasttext.cc/) keep-probability model. The fork
|
| 22 |
+
auto-downloads it on first use; you normally don't fetch it by hand.
|
| 23 |
+
|
| 24 |
+
## Files
|
| 25 |
+
|
| 26 |
+
| File | Size | What it is |
|
| 27 |
+
|---|--:|---|
|
| 28 |
+
| `general-ftstack.joblib` | ~9 MB | scikit-learn classifier (RandomForest over structural + text features) |
|
| 29 |
+
| `general_ft.bin` | ~770 MB | fastText char/word-ngram model providing the stacked keep-probability feature |
|
| 30 |
+
|
| 31 |
+
## Quality (general dev set: token ROUGE-L F1 / char Levenshtein similarity)
|
| 32 |
+
|
| 33 |
+
| Tier | F1 | Lev | Footprint |
|
| 34 |
+
|---|--:|--:|---|
|
| 35 |
+
| heuristic (no model) | 0.821 | 0.741 | none |
|
| 36 |
+
| bundled 3 MB sklearn | 0.866 | 0.795 | ships in the wheel |
|
| 37 |
+
| **this model (fastText stack)** | **0.886** | **0.823** | this repo (~780 MB) |
|
| 38 |
+
|
| 39 |
+
For reference, stock upstream jusText scores ~0.76 on the same set; the fork's structural
|
| 40 |
+
fixes (forum/FAQ/comment role-transforms, code formatting, URL/mojibake repair, …) lift every
|
| 41 |
+
tier well above that even before the learned classifier.
|
| 42 |
+
|
| 43 |
+
## Usage
|
| 44 |
+
|
| 45 |
+
Install the fork with the fastText extra, then just call `justext` — the model is fetched and
|
| 46 |
+
cached (`~/.cache/justext`) on first use:
|
| 47 |
+
|
| 48 |
+
```bash
|
| 49 |
+
pip install "jusText[fasttext] @ git+https://github.com/XenonMolecule/jusText"
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
import justext
|
| 54 |
+
|
| 55 |
+
stoplist = justext.get_stoplist("English")
|
| 56 |
+
paragraphs = justext.justext(html, stoplist) # auto-uses this model
|
| 57 |
+
content = "\n\n".join(p.text for p in paragraphs if not p.is_boilerplate)
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
Fetch explicitly (e.g. to pre-warm the cache):
|
| 61 |
+
|
| 62 |
+
```python
|
| 63 |
+
import justext
|
| 64 |
+
joblib_path, fasttext_path = justext.download_fasttext() # pulls both files here
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
### Tier / behaviour knobs (environment variables)
|
| 68 |
+
|
| 69 |
+
| Variable | Effect |
|
| 70 |
+
|---|---|
|
| 71 |
+
| `JUSTEXT_MODEL` | `fasttext` \| `sklearn` \| `heuristic` \| `auto` (default) |
|
| 72 |
+
| `JUSTEXT_NO_DOWNLOAD` | set to skip the download and use the bundled 3 MB model |
|
| 73 |
+
| `JUSTEXT_HF_REPO` | point at a different repo (default `MichaelR207/justext-classifier`) |
|
| 74 |
+
| `JUSTEXT_CACHE` | override the download cache directory |
|
| 75 |
+
|
| 76 |
+
If `fasttext` isn't installed, or the download fails, the fork degrades gracefully to the
|
| 77 |
+
bundled 3 MB model and then to the heuristic classifier — it always works offline.
|
| 78 |
+
|
| 79 |
+
## How it was trained
|
| 80 |
+
|
| 81 |
+
- **Structural classifier**: RandomForest over jusText's per-paragraph heuristic features
|
| 82 |
+
(link density, stopword density, length, tag context) plus neighbour signals.
|
| 83 |
+
- **Stacked text model**: a fastText classifier trained on ~100k labelled paragraphs; its
|
| 84 |
+
keep-probability is appended as a feature to the structural model.
|
| 85 |
+
- Tuned on an LLM-distilled main-content extraction benchmark (`general` split).
|
| 86 |
+
|
| 87 |
+
## License
|
| 88 |
+
|
| 89 |
+
BSD 2-Clause, same as jusText. See the [fork repository](https://github.com/XenonMolecule/jusText).
|