| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - document-classification |
| - scientific-papers |
| - ai-detection |
| - toxicity-detection |
| - model2vec |
| - pubverse |
| - publication-screening |
| - quality-control |
| library_name: model2vec |
| pipeline_tag: text-classification |
| thumbnail: PubGuard.png |
| --- |
| |
| <div align="center"> |
| <img src="PubGuard.png" alt="PubGuard Logo" width="400"/> |
| </div> |
|
|
| # PubGuard β Multi-Head Scientific Publication Gatekeeper |
|
|
| ## Model Description |
|
|
| PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It rejects non-publications (posters, abstracts, review articles, flyers, invoices) before expensive downstream processing. |
|
|
| Three classification heads provide a multi-dimensional screening verdict: |
|
|
| 1. **Document type** β Is this a paper, review, poster, abstract, or junk? |
| 2. **AI detection** β Was this written by a human or generated by an LLM? |
| 3. **Toxicity** β Does this contain toxic or offensive content? |
|
|
| Developed by Jamey O'Neill at the California Medical Innovations Institute (CalMIΒ²). |
|
|
| ## Architecture |
|
|
| Three linear classification heads on frozen [model2vec](https://github.com/MinishLab/model2vec) (potion-base-32M) embeddings: |
|
|
| ``` |
| βββββββββββββββ |
| β PDF text β |
| ββββββββ¬βββββββ |
| β |
| ββββββββΌβββββββ βββββββββββββββββββββ |
| β clean_text ββββββΊβ model2vec encode ββββΊ emb β R^512 |
| βββββββββββββββ βββββββββββββββββββββ |
| β |
| βββββββββββββββββββΌββββββββββββββββββ |
| βΌ βΌ βΌ |
| βββββββββββββββββββ ββββββββββββββββ ββββββββββββββββ |
| β doc_type head β β ai_detect β β toxicity β |
| β [emb + 14 feats] β β head β β head β |
| β β softmax(5) β β β softmax(2) β β β softmax(2) β |
| βββββββββββββββββββ ββββββββββββββββ ββββββββββββββββ |
| ``` |
|
|
| Each head is a single linear layer stored as a numpy `.npz` file (8β12 KB). Inference is pure numpy β no torch needed at prediction time. |
|
|
| The `doc_type` head additionally receives 14 structural features (section headings present, citation density, sentence length, etc.) concatenated with the embedding β these act as strong Bayesian priors. |
|
|
| ## Performance |
|
|
| | Head | Classes | Accuracy | F1 | |
| |------|---------|----------|-----| |
| | **doc_type** | 5 | **94.4%** | 0.944 | |
| | **ai_detect** | 2 | 84.2% | 0.842 | |
| | **toxicity** | 2 | 83.9% | 0.839 | |
|
|
| ### doc_type Breakdown |
| |
| | Class | Precision | Recall | F1 | |
| |-------|-----------|--------|-----| |
| | scientific_paper | 0.891 | 0.932 | 0.911 | |
| | literature_review | 0.914 | 0.884 | 0.899 | |
| | poster | 0.938 | 0.917 | 0.928 | |
| | abstract_only | 0.985 | 0.991 | 0.988 | |
| | junk | 0.992 | 0.996 | 0.994 | |
|
|
| ### Throughput |
|
|
| - **302 docs/sec** single-document, **568 docs/sec** batched (CPU only) |
| - **3.3ms** per PDF screening β negligible pipeline overhead |
| - No GPU required |
|
|
| ## Gate Logic |
|
|
| Only `scientific_paper` passes the gate. Everything else β literature reviews, posters, standalone abstracts, junk β is blocked. The PubVerse pipeline processes **original research publications only**. |
|
|
| ``` |
| scientific_paper β β
PASS |
| literature_review β β BLOCKED (narrative/scoping reviews) |
| poster β β BLOCKED (classified, but not a publication) |
| abstract_only β β BLOCKED |
| junk β β BLOCKED |
| ``` |
|
|
| Note: Meta-analyses and systematic reviews are classified as `scientific_paper` (they are primary research). Only narrative and scoping reviews are classified as `literature_review`. |
|
|
| AI detection and toxicity are **informational by default** β reported but not blocking. |
|
|
| ## Usage |
|
|
| ### Python API |
|
|
| ```python |
| from pubguard import PubGuard |
| |
| guard = PubGuard() |
| guard.initialize() |
| |
| verdict = guard.screen("Introduction: We present a novel deep learning approach...") |
| print(verdict) |
| # { |
| # 'doc_type': {'label': 'scientific_paper', 'score': 0.994}, |
| # 'ai_generated': {'label': 'human', 'score': 0.875}, |
| # 'toxicity': {'label': 'clean', 'score': 0.999}, |
| # 'pass': True |
| # } |
| ``` |
|
|
| ### Pipeline Integration (bash) |
|
|
| ```bash |
| # Step 0 in run_pubverse_pipeline.sh: |
| PDF_TEXT=$(python3 -c "import fitz; d=fitz.open('$pdf'); print(' '.join(p.get_text() for p in d)[:8000])") |
| PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null) |
| # exit 0 = pass, exit 1 = reject |
| ``` |
|
|
| ### Installation |
|
|
| ```bash |
| pip install git+https://github.com/jimnoneill/pubguard.git |
| ``` |
|
|
| With training dependencies: |
|
|
| ```bash |
| pip install "pubguard[train] @ git+https://github.com/jimnoneill/pubguard.git" |
| ``` |
|
|
| ## Training Data |
|
|
| Trained on real datasets β **zero synthetic data**: |
|
|
| | Head | Sources | Samples | |
| |------|---------|---------| |
| | **doc_type** | PDF corpus (microbiome/metagenomics), armanc/scientific_papers, OpenAlex OA review PDFs + abstracts, [poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data), gfissore/arxiv-abstracts-2021, ag_news | 75K (15K per class) | |
| | **ai_detect** | liamdugan/raid (abstracts), NicolaiSivesind/ChatGPT-Research-Abstracts | 30K | |
| | **toxicity** | google/civil_comments, skg/toxigen-data | 30K | |
| |
| The poster class uses real scientific poster text from the [posters.science](https://posters.science) corpus (28K+ verified posters from Zenodo & Figshare), extracted by [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry). |
| |
| The literature_review class uses a mix of open-access review article PDFs downloaded from OpenAlex and review abstracts as fallback. See the [training data repo](https://huggingface.co/datasets/jimnoneill/pubguard-training-data) for full details. |
|
|
| ### Training |
|
|
| ```bash |
| python scripts/train_pubguard.py --pdf-corpus /path/to/pdfs --n-per-class 15000 |
| ``` |
|
|
| Training completes in ~1 minute on CPU. No GPU needed. |
|
|
| ## Model Specifications |
|
|
| | Attribute | Value | |
| |-----------|-------| |
| | Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) | |
| | Embedding dimension | 512 | |
| | Structural features | 14 (doc_type head only) | |
| | Classifier | LogisticRegression (sklearn) per head | |
| | Head file sizes | 5β12 KB each (.npz) | |
| | Total model size | ~125 MB (embedding) + 25 KB (heads) | |
| | Precision | float32 | |
| | GPU required | No (CPU-only) | |
| | License | MIT | |
| |
| ## Citation |
| |
| ```bibtex |
| @software{pubguard_2026, |
| title = {PubGuard: Multi-Head Scientific Publication Gatekeeper}, |
| author = {O'Neill, James}, |
| year = {2026}, |
| url = {https://huggingface.co/jimnoneill/pubguard-classifier}, |
| note = {Part of the PubVerse + 42DeepThought pipeline} |
| } |
| ``` |
| |
| ## License |
| |
| This model is released under the [MIT License](https://opensource.org/licenses/MIT). |
| |
| ## Acknowledgments |
| |
| - California Medical Innovations Institute (CalMIΒ²) |
| - [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone |
| - [FAIR Data Innovations Hub](https://fairdataihub.org/) for the [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry) training data |
| - HuggingFace for model hosting infrastructure |
| |