File size: 7,009 Bytes
9690fd1 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 181aa2b f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 181aa2b f2aa9cc 181aa2b f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc b5810c8 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc 9690fd1 f2aa9cc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
---
license: mit
language:
- en
tags:
- document-classification
- scientific-papers
- ai-detection
- toxicity-detection
- model2vec
- pubverse
- publication-screening
- quality-control
library_name: model2vec
pipeline_tag: text-classification
thumbnail: PubGuard.png
---
<div align="center">
<img src="PubGuard.png" alt="PubGuard Logo" width="400"/>
</div>
# PubGuard β Multi-Head Scientific Publication Gatekeeper
## Model Description
PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It runs as **Step 0** in the PubVerse + 42DeepThought pipeline, rejecting non-publications (posters, abstracts, flyers, invoices) before expensive downstream processing (VLM feature extraction, graph construction, GNN scoring).
Three classification heads provide a multi-dimensional screening verdict:
1. **Document type** β Is this a paper, poster, abstract, or junk?
2. **AI detection** β Was this written by a human or generated by an LLM?
3. **Toxicity** β Does this contain toxic or offensive content?
Developed by Jamey O'Neill at the California Medical Innovations Institute (CalMIΒ²).
## Architecture
Three linear classification heads on frozen [model2vec](https://github.com/MinishLab/model2vec) (potion-base-32M) embeddings:
```
βββββββββββββββ
β PDF text β
ββββββββ¬βββββββ
β
ββββββββΌβββββββ βββββββββββββββββββββ
β clean_text ββββββΊβ model2vec encode ββββΊ emb β R^512
βββββββββββββββ βββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β doc_type head β β ai_detect β β toxicity β
β [emb + 14 feats] β β head β β head β
β β softmax(4) β β β softmax(2) β β β softmax(2) β
βββββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
```
Each head is a single linear layer stored as a numpy `.npz` file (8β12 KB). Inference is pure numpy β no torch needed at prediction time.
The `doc_type` head additionally receives 14 structural features (section headings present, citation density, sentence length, etc.) concatenated with the embedding β these act as strong Bayesian priors.
## Performance
| Head | Classes | Accuracy | F1 |
|------|---------|----------|-----|
| **doc_type** | 4 | **99.7%** | 0.997 |
| **ai_detect** | 2 | 83.4% | 0.834 |
| **toxicity** | 2 | 84.7% | 0.847 |
### doc_type Breakdown
| Class | Precision | Recall | F1 |
|-------|-----------|--------|-----|
| scientific_paper | 1.000 | 1.000 | 1.000 |
| poster | 0.989 | 0.974 | 0.981 |
| abstract_only | 0.997 | 0.997 | 0.997 |
| junk | 0.993 | 0.998 | 0.996 |
### Throughput
- **302 docs/sec** single-document, **568 docs/sec** batched (CPU only)
- **3.3ms** per PDF screening β negligible pipeline overhead
- No GPU required
## Gate Logic
Only `scientific_paper` passes the gate. Everything else β posters, standalone abstracts, junk β is blocked. The PubVerse pipeline processes **publications only**.
```
scientific_paper β β
PASS
poster β β BLOCKED (classified, but not a publication)
abstract_only β β BLOCKED
junk β β BLOCKED
```
AI detection and toxicity are **informational by default** β reported but not blocking.
## Usage
### Python API
```python
from pubguard import PubGuard
guard = PubGuard()
guard.initialize()
verdict = guard.screen("Introduction: We present a novel deep learning approach...")
print(verdict)
# {
# 'doc_type': {'label': 'scientific_paper', 'score': 0.994},
# 'ai_generated': {'label': 'human', 'score': 0.875},
# 'toxicity': {'label': 'clean', 'score': 0.999},
# 'pass': True
# }
```
### Pipeline Integration (bash)
```bash
# Step 0 in run_pubverse_pipeline.sh:
PDF_TEXT=$(python3 -c "import fitz; d=fitz.open('$pdf'); print(' '.join(p.get_text() for p in d)[:8000])")
PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
# exit 0 = pass, exit 1 = reject
```
### Installation
```bash
pip install git+https://github.com/jimnoneill/pubguard.git
```
With training dependencies:
```bash
pip install "pubguard[train] @ git+https://github.com/jimnoneill/pubguard.git"
```
## Training Data
Trained on real datasets from HuggingFace β **zero synthetic junk data**:
| Head | Sources | Samples |
|------|---------|---------|
| **doc_type** | armanc/scientific_papers, gfissore/arxiv-abstracts-2021, ag_news, [poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data) | ~55K |
| **ai_detect** | liamdugan/raid (abstracts), NicolaiSivesind/ChatGPT-Research-Abstracts | ~30K |
| **toxicity** | google/civil_comments, skg/toxigen-data | ~30K |
The poster class uses real scientific poster text from the [posters.science](https://posters.science) corpus (28K+ verified posters from Zenodo & Figshare), extracted by [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry).
### Training
```bash
python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000
```
Training completes in ~1 minute on CPU. No GPU needed.
## Model Specifications
| Attribute | Value |
|-----------|-------|
| Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
| Embedding dimension | 512 |
| Structural features | 14 (doc_type head only) |
| Classifier | LogisticRegression (sklearn) per head |
| Head file sizes | 5β9 KB each (.npz) |
| Total model size | ~125 MB (embedding) + 20 KB (heads) |
| Precision | float32 |
| GPU required | No (CPU-only) |
| License | MIT |
## Citation
```bibtex
@software{pubguard_2026,
title = {PubGuard: Multi-Head Scientific Publication Gatekeeper},
author = {O'Neill, James},
year = {2026},
url = {https://huggingface.co/jimnoneill/pubguard-classifier},
note = {Part of the PubVerse + 42DeepThought pipeline}
}
```
## License
This model is released under the [MIT License](https://opensource.org/licenses/MIT).
## Acknowledgments
- California Medical Innovations Institute (CalMIΒ²)
- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
- [FAIR Data Innovations Hub](https://fairdataihub.org/) for the [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry) training data
- HuggingFace for model hosting infrastructure
|