PubGuard β€” Multi-Head Scientific Publication Gatekeeper

PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It runs as a gate in the PubVerse pipeline, rejecting junk (flyers, invoices, posters) before expensive downstream processing.

Architecture

Three linear classification heads on frozen model2vec (potion-base-32M) embeddings:

Head Classes Accuracy Description
doc_type 4 99.9% scientific_paper | poster | abstract_only | junk
ai_detect 2 83.4% human | ai_generated
toxicity 2 84.7% clean | toxic

Each head is a single linear layer stored as a numpy .npz file (8-12 KB each). Inference is pure numpy β€” no torch needed at prediction time.

Performance

  • 302 docs/sec single-document, 568 docs/sec batched (CPU only)
  • 3.3ms per PDF screening β€” negligible pipeline overhead
  • No GPU required

Usage

from pubguard import PubGuard

guard = PubGuard()
guard.initialize()

verdict = guard.screen("Introduction: We present a novel deep learning approach...")
# {
#   'doc_type': {'label': 'scientific_paper', 'score': 0.994},
#   'ai_generated': {'label': 'human', 'score': 0.875},
#   'toxicity': {'label': 'clean', 'score': 0.999},
#   'pass': True
# }

Pipeline Integration

# In run_pubverse_pipeline.sh:
PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
# exit 0 = pass, exit 1 = reject

Training Data

Trained on datasets from HuggingFace (15K samples/class):

  • doc_type: armanc/scientific_papers + gfissore/arxiv-abstracts-2021 + ag_news + synthetic
  • ai_detect: liamdugan/raid (abstracts) + NicolaiSivesind/ChatGPT-Research-Abstracts
  • toxicity: google/civil_comments + skg/toxigen-data

Training

cd pub_check
pip install -e ".[train]"
python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000

Training completes in ~1 minute on CPU. No GPU needed.

Citation

Part of the PubVerse + 42DeepThought pipeline by Jamey O'Neill (CALMI2).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support