Update README.md

6f51fb2 verified 29 days ago

7.57 kB

	---
	license: mit
	language:
	- en
	tags:
	- document-classification
	- scientific-papers
	- ai-detection
	- toxicity-detection
	- model2vec
	- pubverse
	- publication-screening
	- quality-control
	library_name: model2vec
	pipeline_tag: text-classification
	thumbnail: PubGuard.png
	---

	<div align="center">
	<img src="PubGuard.png" alt="PubGuard Logo" width="400"/>
	</div>

	# PubGuard — Multi-Head Scientific Publication Gatekeeper

	## Model Description

	PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It rejects non-publications (posters, abstracts, review articles, flyers, invoices) before expensive downstream processing.

	Three classification heads provide a multi-dimensional screening verdict:

	1. Document type — Is this a paper, review, poster, abstract, or junk?
	2. AI detection — Was this written by a human or generated by an LLM?
	3. Toxicity — Does this contain toxic or offensive content?

	Developed by Jamey O'Neill at the California Medical Innovations Institute (CalMI²).

	## Architecture

	Three linear classification heads on frozen [model2vec](https://github.com/MinishLab/model2vec) (potion-base-32M) embeddings:

	```
	┌─────────────┐
	│ PDF text │
	└──────┬──────┘
	│
	┌──────▼──────┐ ┌───────────────────┐
	│ clean_text │────►│ model2vec encode │──► emb ∈ R^512
	└─────────────┘ └───────────────────┘
	│
	┌─────────────────┼─────────────────┐
	▼ ▼ ▼
	┌─────────────────┐ ┌──────────────┐ ┌──────────────┐
	│ doc_type head │ │ ai_detect │ │ toxicity │
	│ [emb + 14 feats] │ │ head │ │ head │
	│ → softmax(5) │ │ → softmax(2) │ │ → softmax(2) │
	└─────────────────┘ └──────────────┘ └──────────────┘
	```

	Each head is a single linear layer stored as a numpy `.npz` file (8–12 KB). Inference is pure numpy — no torch needed at prediction time.

	The `doc_type` head additionally receives 14 structural features (section headings present, citation density, sentence length, etc.) concatenated with the embedding — these act as strong Bayesian priors.

	## Performance

	\| Head \| Classes \| Accuracy \| F1 \|
	\|------\|---------\|----------\|-----\|
	\| doc_type \| 5 \| 94.4% \| 0.944 \|
	\| ai_detect \| 2 \| 84.2% \| 0.842 \|
	\| toxicity \| 2 \| 83.9% \| 0.839 \|

	### doc_type Breakdown

	\| Class \| Precision \| Recall \| F1 \|
	\|-------\|-----------\|--------\|-----\|
	\| scientific_paper \| 0.891 \| 0.932 \| 0.911 \|
	\| literature_review \| 0.914 \| 0.884 \| 0.899 \|
	\| poster \| 0.938 \| 0.917 \| 0.928 \|
	\| abstract_only \| 0.985 \| 0.991 \| 0.988 \|
	\| junk \| 0.992 \| 0.996 \| 0.994 \|

	### Throughput

	- 302 docs/sec single-document, 568 docs/sec batched (CPU only)
	- 3.3ms per PDF screening — negligible pipeline overhead
	- No GPU required

	## Gate Logic

	Only `scientific_paper` passes the gate. Everything else — literature reviews, posters, standalone abstracts, junk — is blocked. The PubVerse pipeline processes original research publications only.

	```
	scientific_paper → ✅ PASS
	literature_review → ❌ BLOCKED (narrative/scoping reviews)
	poster → ❌ BLOCKED (classified, but not a publication)
	abstract_only → ❌ BLOCKED
	junk → ❌ BLOCKED
	```

	Note: Meta-analyses and systematic reviews are classified as `scientific_paper` (they are primary research). Only narrative and scoping reviews are classified as `literature_review`.

	AI detection and toxicity are informational by default — reported but not blocking.

	## Usage

	### Python API

	```python
	from pubguard import PubGuard

	guard = PubGuard()
	guard.initialize()

	verdict = guard.screen("Introduction: We present a novel deep learning approach...")
	print(verdict)
	# {
	# 'doc_type': {'label': 'scientific_paper', 'score': 0.994},
	# 'ai_generated': {'label': 'human', 'score': 0.875},
	# 'toxicity': {'label': 'clean', 'score': 0.999},
	# 'pass': True
	# }
	```

	### Pipeline Integration (bash)

	```bash
	# Step 0 in run_pubverse_pipeline.sh:
	PDF_TEXT=$(python3 -c "import fitz; d=fitz.open('$pdf'); print(' '.join(p.get_text() for p in d)[:8000])")
	PUBGUARD_CODE=$(echo "$PDF_TEXT" \| python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
	# exit 0 = pass, exit 1 = reject
	```

	### Installation

	```bash
	pip install git+https://github.com/jimnoneill/pubguard.git
	```

	With training dependencies:

	```bash
	pip install "pubguard[train] @ git+https://github.com/jimnoneill/pubguard.git"
	```

	## Training Data

	Trained on real datasets — zero synthetic data:

	\| Head \| Sources \| Samples \|
	\|------\|---------\|---------\|
	\| doc_type \| PDF corpus (microbiome/metagenomics), armanc/scientific_papers, OpenAlex OA review PDFs + abstracts, [poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data), gfissore/arxiv-abstracts-2021, ag_news \| 75K (15K per class) \|
	\| ai_detect \| liamdugan/raid (abstracts), NicolaiSivesind/ChatGPT-Research-Abstracts \| 30K \|
	\| toxicity \| google/civil_comments, skg/toxigen-data \| 30K \|

	The poster class uses real scientific poster text from the [posters.science](https://posters.science) corpus (28K+ verified posters from Zenodo & Figshare), extracted by [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry).

	The literature_review class uses a mix of open-access review article PDFs downloaded from OpenAlex and review abstracts as fallback. See the [training data repo](https://huggingface.co/datasets/jimnoneill/pubguard-training-data) for full details.

	### Training

	```bash
	python scripts/train_pubguard.py --pdf-corpus /path/to/pdfs --n-per-class 15000
	```

	Training completes in ~1 minute on CPU. No GPU needed.

	## Model Specifications

	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Embedding backbone \| minishlab/potion-base-32M (model2vec StaticModel) \|
	\| Embedding dimension \| 512 \|
	\| Structural features \| 14 (doc_type head only) \|
	\| Classifier \| LogisticRegression (sklearn) per head \|
	\| Head file sizes \| 5–12 KB each (.npz) \|
	\| Total model size \| ~125 MB (embedding) + 25 KB (heads) \|
	\| Precision \| float32 \|
	\| GPU required \| No (CPU-only) \|
	\| License \| MIT \|

	## Citation

	```bibtex
	@software{pubguard_2026,
	title = {PubGuard: Multi-Head Scientific Publication Gatekeeper},
	author = {O'Neill, James},
	year = {2026},
	url = {https://huggingface.co/jimnoneill/pubguard-classifier},
	note = {Part of the PubVerse + 42DeepThought pipeline}
	}
	```

	## License

	This model is released under the [MIT License](https://opensource.org/licenses/MIT).

	## Acknowledgments

	- California Medical Innovations Institute (CalMI²)
	- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
	- [FAIR Data Innovations Hub](https://fairdataihub.org/) for the [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry) training data
	- HuggingFace for model hosting infrastructure