PubGuard β Multi-Head Scientific Publication Gatekeeper
PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It runs as a gate in the PubVerse pipeline, rejecting junk (flyers, invoices, posters) before expensive downstream processing.
Architecture
Three linear classification heads on frozen model2vec (potion-base-32M) embeddings:
| Head | Classes | Accuracy | Description |
|---|---|---|---|
| doc_type | 4 | 99.9% | scientific_paper | poster | abstract_only | junk |
| ai_detect | 2 | 83.4% | human | ai_generated |
| toxicity | 2 | 84.7% | clean | toxic |
Each head is a single linear layer stored as a numpy .npz file (8-12 KB each).
Inference is pure numpy β no torch needed at prediction time.
Performance
- 302 docs/sec single-document, 568 docs/sec batched (CPU only)
- 3.3ms per PDF screening β negligible pipeline overhead
- No GPU required
Usage
from pubguard import PubGuard
guard = PubGuard()
guard.initialize()
verdict = guard.screen("Introduction: We present a novel deep learning approach...")
# {
# 'doc_type': {'label': 'scientific_paper', 'score': 0.994},
# 'ai_generated': {'label': 'human', 'score': 0.875},
# 'toxicity': {'label': 'clean', 'score': 0.999},
# 'pass': True
# }
Pipeline Integration
# In run_pubverse_pipeline.sh:
PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
# exit 0 = pass, exit 1 = reject
Training Data
Trained on datasets from HuggingFace (15K samples/class):
- doc_type: armanc/scientific_papers + gfissore/arxiv-abstracts-2021 + ag_news + synthetic
- ai_detect: liamdugan/raid (abstracts) + NicolaiSivesind/ChatGPT-Research-Abstracts
- toxicity: google/civil_comments + skg/toxigen-data
Training
cd pub_check
pip install -e ".[train]"
python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000
Training completes in ~1 minute on CPU. No GPU needed.
Citation
Part of the PubVerse + 42DeepThought pipeline by Jamey O'Neill (CALMI2).