File size: 3,277 Bytes
f9cb3d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
license: bsd-2-clause
library_name: justext
tags:
  - boilerplate-removal
  - web-extraction
  - text-extraction
  - justext
  - fasttext
language:
  - en
---

# justext-classifier

The learned paragraph classifier for the improved [jusText fork](https://github.com/XenonMolecule/jusText)
— a boilerplate-removal tool that extracts the main content from an HTML page and drops
navigation, sidebars, footers, and other chrome.

This repo hosts the **highest-quality tier**: a scikit-learn paragraph classifier whose
features are stacked with a [fastText](https://fasttext.cc/) keep-probability model. The fork
auto-downloads it on first use; you normally don't fetch it by hand.

## Files

| File | Size | What it is |
|---|--:|---|
| `general-ftstack.joblib` | ~9 MB | scikit-learn classifier (RandomForest over structural + text features) |
| `general_ft.bin` | ~770 MB | fastText char/word-ngram model providing the stacked keep-probability feature |

## Quality (general dev set: token ROUGE-L F1 / char Levenshtein similarity)

| Tier | F1 | Lev | Footprint |
|---|--:|--:|---|
| heuristic (no model) | 0.821 | 0.741 | none |
| bundled 3 MB sklearn | 0.866 | 0.795 | ships in the wheel |
| **this model (fastText stack)** | **0.886** | **0.823** | this repo (~780 MB) |

For reference, stock upstream jusText scores ~0.76 on the same set; the fork's structural
fixes (forum/FAQ/comment role-transforms, code formatting, URL/mojibake repair, …) lift every
tier well above that even before the learned classifier.

## Usage

Install the fork with the fastText extra, then just call `justext` — the model is fetched and
cached (`~/.cache/justext`) on first use:

```bash
pip install "jusText[fasttext] @ git+https://github.com/XenonMolecule/jusText"
```

```python
import justext

stoplist = justext.get_stoplist("English")
paragraphs = justext.justext(html, stoplist)          # auto-uses this model
content = "\n\n".join(p.text for p in paragraphs if not p.is_boilerplate)
```

Fetch explicitly (e.g. to pre-warm the cache):

```python
import justext
joblib_path, fasttext_path = justext.download_fasttext()   # pulls both files here
```

### Tier / behaviour knobs (environment variables)

| Variable | Effect |
|---|---|
| `JUSTEXT_MODEL` | `fasttext` \| `sklearn` \| `heuristic` \| `auto` (default) |
| `JUSTEXT_NO_DOWNLOAD` | set to skip the download and use the bundled 3 MB model |
| `JUSTEXT_HF_REPO` | point at a different repo (default `MichaelR207/justext-classifier`) |
| `JUSTEXT_CACHE` | override the download cache directory |

If `fasttext` isn't installed, or the download fails, the fork degrades gracefully to the
bundled 3 MB model and then to the heuristic classifier — it always works offline.

## How it was trained

- **Structural classifier**: RandomForest over jusText's per-paragraph heuristic features
  (link density, stopword density, length, tag context) plus neighbour signals.
- **Stacked text model**: a fastText classifier trained on ~100k labelled paragraphs; its
  keep-probability is appended as a feature to the structural model.
- Tuned on an LLM-distilled main-content extraction benchmark (`general` split).

## License

BSD 2-Clause, same as jusText. See the [fork repository](https://github.com/XenonMolecule/jusText).