PubGuard — Multi-Head Scientific Publication Gatekeeper

PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It runs as a gate in the PubVerse pipeline, rejecting junk (flyers, invoices, posters) before expensive downstream processing.

Architecture

Three linear classification heads on frozen model2vec (potion-base-32M) embeddings:

Head	Classes	Accuracy	Description
doc_type	4	99.9%	scientific_paper \| poster \| abstract_only \| junk
ai_detect	2	83.4%	human \| ai_generated
toxicity	2	84.7%	clean \| toxic

Each head is a single linear layer stored as a numpy .npz file (8-12 KB each). Inference is pure numpy — no torch needed at prediction time.

Performance

302 docs/sec single-document, 568 docs/sec batched (CPU only)
3.3ms per PDF screening — negligible pipeline overhead
No GPU required

Usage

from pubguard import PubGuard

guard = PubGuard()
guard.initialize()

verdict = guard.screen("Introduction: We present a novel deep learning approach...")
# {
#   'doc_type': {'label': 'scientific_paper', 'score': 0.994},
#   'ai_generated': {'label': 'human', 'score': 0.875},
#   'toxicity': {'label': 'clean', 'score': 0.999},
#   'pass': True
# }

Pipeline Integration

# In run_pubverse_pipeline.sh:
PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
# exit 0 = pass, exit 1 = reject

Training Data

Trained on datasets from HuggingFace (15K samples/class):

doc_type: armanc/scientific_papers + gfissore/arxiv-abstracts-2021 + ag_news + synthetic
ai_detect: liamdugan/raid (abstracts) + NicolaiSivesind/ChatGPT-Research-Abstracts
toxicity: google/civil_comments + skg/toxigen-data

Training

cd pub_check
pip install -e ".[train]"
python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000

Training completes in ~1 minute on CPU. No GPU needed.

Citation

Part of the PubVerse + 42DeepThought pipeline by Jamey O'Neill (CALMI2).

Downloads last month: -; Downloads are not tracked for this model. How to track