MichaelR207 commited on
Commit
f9cb3d7
·
verified ·
1 Parent(s): d055a14

Add model card

Browse files
Files changed (1) hide show
  1. README.md +89 -0
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bsd-2-clause
3
+ library_name: justext
4
+ tags:
5
+ - boilerplate-removal
6
+ - web-extraction
7
+ - text-extraction
8
+ - justext
9
+ - fasttext
10
+ language:
11
+ - en
12
+ ---
13
+
14
+ # justext-classifier
15
+
16
+ The learned paragraph classifier for the improved [jusText fork](https://github.com/XenonMolecule/jusText)
17
+ — a boilerplate-removal tool that extracts the main content from an HTML page and drops
18
+ navigation, sidebars, footers, and other chrome.
19
+
20
+ This repo hosts the **highest-quality tier**: a scikit-learn paragraph classifier whose
21
+ features are stacked with a [fastText](https://fasttext.cc/) keep-probability model. The fork
22
+ auto-downloads it on first use; you normally don't fetch it by hand.
23
+
24
+ ## Files
25
+
26
+ | File | Size | What it is |
27
+ |---|--:|---|
28
+ | `general-ftstack.joblib` | ~9 MB | scikit-learn classifier (RandomForest over structural + text features) |
29
+ | `general_ft.bin` | ~770 MB | fastText char/word-ngram model providing the stacked keep-probability feature |
30
+
31
+ ## Quality (general dev set: token ROUGE-L F1 / char Levenshtein similarity)
32
+
33
+ | Tier | F1 | Lev | Footprint |
34
+ |---|--:|--:|---|
35
+ | heuristic (no model) | 0.821 | 0.741 | none |
36
+ | bundled 3 MB sklearn | 0.866 | 0.795 | ships in the wheel |
37
+ | **this model (fastText stack)** | **0.886** | **0.823** | this repo (~780 MB) |
38
+
39
+ For reference, stock upstream jusText scores ~0.76 on the same set; the fork's structural
40
+ fixes (forum/FAQ/comment role-transforms, code formatting, URL/mojibake repair, …) lift every
41
+ tier well above that even before the learned classifier.
42
+
43
+ ## Usage
44
+
45
+ Install the fork with the fastText extra, then just call `justext` — the model is fetched and
46
+ cached (`~/.cache/justext`) on first use:
47
+
48
+ ```bash
49
+ pip install "jusText[fasttext] @ git+https://github.com/XenonMolecule/jusText"
50
+ ```
51
+
52
+ ```python
53
+ import justext
54
+
55
+ stoplist = justext.get_stoplist("English")
56
+ paragraphs = justext.justext(html, stoplist) # auto-uses this model
57
+ content = "\n\n".join(p.text for p in paragraphs if not p.is_boilerplate)
58
+ ```
59
+
60
+ Fetch explicitly (e.g. to pre-warm the cache):
61
+
62
+ ```python
63
+ import justext
64
+ joblib_path, fasttext_path = justext.download_fasttext() # pulls both files here
65
+ ```
66
+
67
+ ### Tier / behaviour knobs (environment variables)
68
+
69
+ | Variable | Effect |
70
+ |---|---|
71
+ | `JUSTEXT_MODEL` | `fasttext` \| `sklearn` \| `heuristic` \| `auto` (default) |
72
+ | `JUSTEXT_NO_DOWNLOAD` | set to skip the download and use the bundled 3 MB model |
73
+ | `JUSTEXT_HF_REPO` | point at a different repo (default `MichaelR207/justext-classifier`) |
74
+ | `JUSTEXT_CACHE` | override the download cache directory |
75
+
76
+ If `fasttext` isn't installed, or the download fails, the fork degrades gracefully to the
77
+ bundled 3 MB model and then to the heuristic classifier — it always works offline.
78
+
79
+ ## How it was trained
80
+
81
+ - **Structural classifier**: RandomForest over jusText's per-paragraph heuristic features
82
+ (link density, stopword density, length, tag context) plus neighbour signals.
83
+ - **Stacked text model**: a fastText classifier trained on ~100k labelled paragraphs; its
84
+ keep-probability is appended as a feature to the structural model.
85
+ - Tuned on an LLM-distilled main-content extraction benchmark (`general` split).
86
+
87
+ ## License
88
+
89
+ BSD 2-Clause, same as jusText. See the [fork repository](https://github.com/XenonMolecule/jusText).