MichaelR207
/

justext-classifier

boilerplate-removal

text-extraction

Model card Files Files and versions

justext-classifier / README.md

MichaelR207's picture

Add model card

f9cb3d7 verified 10 days ago

|

History Blame Contribute Delete

3.28 kB

	---
	license: bsd-2-clause
	library_name: justext
	tags:
	- boilerplate-removal
	- web-extraction
	- text-extraction
	- justext
	- fasttext
	language:
	- en
	---

	# justext-classifier

	The learned paragraph classifier for the improved [jusText fork](https://github.com/XenonMolecule/jusText)
	— a boilerplate-removal tool that extracts the main content from an HTML page and drops
	navigation, sidebars, footers, and other chrome.

	This repo hosts the highest-quality tier: a scikit-learn paragraph classifier whose
	features are stacked with a [fastText](https://fasttext.cc/) keep-probability model. The fork
	auto-downloads it on first use; you normally don't fetch it by hand.

	## Files

	\| File \| Size \| What it is \|
	\|---\|--:\|---\|
	\| `general-ftstack.joblib` \| ~9 MB \| scikit-learn classifier (RandomForest over structural + text features) \|
	\| `general_ft.bin` \| ~770 MB \| fastText char/word-ngram model providing the stacked keep-probability feature \|

	## Quality (general dev set: token ROUGE-L F1 / char Levenshtein similarity)

	\| Tier \| F1 \| Lev \| Footprint \|
	\|---\|--:\|--:\|---\|
	\| heuristic (no model) \| 0.821 \| 0.741 \| none \|
	\| bundled 3 MB sklearn \| 0.866 \| 0.795 \| ships in the wheel \|
	\| this model (fastText stack) \| 0.886 \| 0.823 \| this repo (~780 MB) \|

	For reference, stock upstream jusText scores ~0.76 on the same set; the fork's structural
	fixes (forum/FAQ/comment role-transforms, code formatting, URL/mojibake repair, …) lift every
	tier well above that even before the learned classifier.

	## Usage

	Install the fork with the fastText extra, then just call `justext` — the model is fetched and
	cached (`~/.cache/justext`) on first use:

	```bash
	pip install "jusText[fasttext] @ git+https://github.com/XenonMolecule/jusText"
	```

	```python
	import justext

	stoplist = justext.get_stoplist("English")
	paragraphs = justext.justext(html, stoplist) # auto-uses this model
	content = "\n\n".join(p.text for p in paragraphs if not p.is_boilerplate)
	```

	Fetch explicitly (e.g. to pre-warm the cache):

	```python
	import justext
	joblib_path, fasttext_path = justext.download_fasttext() # pulls both files here
	```

	### Tier / behaviour knobs (environment variables)

	\| Variable \| Effect \|
	\|---\|---\|
	\| `JUSTEXT_MODEL` \| `fasttext` \\| `sklearn` \\| `heuristic` \\| `auto` (default) \|
	\| `JUSTEXT_NO_DOWNLOAD` \| set to skip the download and use the bundled 3 MB model \|
	\| `JUSTEXT_HF_REPO` \| point at a different repo (default `MichaelR207/justext-classifier`) \|
	\| `JUSTEXT_CACHE` \| override the download cache directory \|

	If `fasttext` isn't installed, or the download fails, the fork degrades gracefully to the
	bundled 3 MB model and then to the heuristic classifier — it always works offline.

	## How it was trained

	- Structural classifier: RandomForest over jusText's per-paragraph heuristic features
	(link density, stopword density, length, tag context) plus neighbour signals.
	- Stacked text model: a fastText classifier trained on ~100k labelled paragraphs; its
	keep-probability is appended as a feature to the structural model.
	- Tuned on an LLM-distilled main-content extraction benchmark (`general` split).

	## License

	BSD 2-Clause, same as jusText. See the [fork repository](https://github.com/XenonMolecule/jusText).