File size: 7,009 Bytes

9690fd1
 
 
 
 
 
 
 
 
 
 
f2aa9cc
 
9690fd1
 
f2aa9cc
9690fd1
 
f2aa9cc
 
 
 
9690fd1
 
f2aa9cc
 
181aa2b
f2aa9cc
 
 
 
 
 
 
 
9690fd1
 
 
f2aa9cc
9690fd1
f2aa9cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9690fd1
f2aa9cc
 
 
9690fd1
 
 
f2aa9cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9690fd1
 
 
 
f2aa9cc
 
181aa2b
f2aa9cc
181aa2b
 
 
 
 
f2aa9cc
 
 
 
9690fd1
 
f2aa9cc
 
9690fd1
 
 
 
 
 
 
f2aa9cc
9690fd1
 
 
 
 
 
 
 
f2aa9cc
9690fd1
 
f2aa9cc
 
9690fd1
 
 
 
f2aa9cc
 
 
b5810c8
 
 
 
 
 
 
f2aa9cc
 
9690fd1
 
f2aa9cc
9690fd1
f2aa9cc
 
 
 
 
9690fd1
f2aa9cc
 
 
9690fd1
 
 
 
 
 
 
f2aa9cc
 
 
 
 
 
 
 
 
 
 
 
 
 
9690fd1
 
f2aa9cc

---
license: mit
language:
  - en
tags:
  - document-classification
  - scientific-papers
  - ai-detection
  - toxicity-detection
  - model2vec
  - pubverse
  - publication-screening
  - quality-control
library_name: model2vec
pipeline_tag: text-classification
thumbnail: PubGuard.png
---

<div align="center">
  <img src="PubGuard.png" alt="PubGuard Logo" width="400"/>
</div>

# PubGuard — Multi-Head Scientific Publication Gatekeeper

## Model Description

PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It runs as **Step 0** in the PubVerse + 42DeepThought pipeline, rejecting non-publications (posters, abstracts, flyers, invoices) before expensive downstream processing (VLM feature extraction, graph construction, GNN scoring).

Three classification heads provide a multi-dimensional screening verdict:

1. **Document type** — Is this a paper, poster, abstract, or junk?
2. **AI detection** — Was this written by a human or generated by an LLM?
3. **Toxicity** — Does this contain toxic or offensive content?

Developed by Jamey O'Neill at the California Medical Innovations Institute (CalMI²).

## Architecture

Three linear classification heads on frozen [model2vec](https://github.com/MinishLab/model2vec) (potion-base-32M) embeddings:

```
┌─────────────┐
│  PDF text    │
└──────┬──────┘
       │
┌──────▼──────┐     ┌───────────────────┐
│  clean_text │────►│  model2vec encode  │──► emb ∈ R^512
└─────────────┘     └───────────────────┘
                            │
          ┌─────────────────┼─────────────────┐
          ▼                 ▼                  ▼
┌─────────────────┐ ┌──────────────┐ ┌──────────────┐
│ doc_type head    │ │ ai_detect    │ │ toxicity     │
│ [emb + 14 feats] │ │ head         │ │ head         │
│ → softmax(4)     │ │ → softmax(2) │ │ → softmax(2) │
└─────────────────┘ └──────────────┘ └──────────────┘
```

Each head is a single linear layer stored as a numpy `.npz` file (8–12 KB). Inference is pure numpy — no torch needed at prediction time.

The `doc_type` head additionally receives 14 structural features (section headings present, citation density, sentence length, etc.) concatenated with the embedding — these act as strong Bayesian priors.

## Performance

| Head | Classes | Accuracy | F1 |
|------|---------|----------|-----|
| **doc_type** | 4 | **99.7%** | 0.997 |
| **ai_detect** | 2 | 83.4% | 0.834 |
| **toxicity** | 2 | 84.7% | 0.847 |

### doc_type Breakdown

| Class | Precision | Recall | F1 |
|-------|-----------|--------|-----|
| scientific_paper | 1.000 | 1.000 | 1.000 |
| poster | 0.989 | 0.974 | 0.981 |
| abstract_only | 0.997 | 0.997 | 0.997 |
| junk | 0.993 | 0.998 | 0.996 |

### Throughput

- **302 docs/sec** single-document, **568 docs/sec** batched (CPU only)
- **3.3ms** per PDF screening — negligible pipeline overhead
- No GPU required

## Gate Logic

Only `scientific_paper` passes the gate. Everything else — posters, standalone abstracts, junk — is blocked. The PubVerse pipeline processes **publications only**.

```
scientific_paper  →  ✅ PASS
poster            →  ❌ BLOCKED  (classified, but not a publication)
abstract_only     →  ❌ BLOCKED
junk              →  ❌ BLOCKED
```

AI detection and toxicity are **informational by default** — reported but not blocking.

## Usage

### Python API

```python
from pubguard import PubGuard

guard = PubGuard()
guard.initialize()

verdict = guard.screen("Introduction: We present a novel deep learning approach...")
print(verdict)
# {
#   'doc_type': {'label': 'scientific_paper', 'score': 0.994},
#   'ai_generated': {'label': 'human', 'score': 0.875},
#   'toxicity': {'label': 'clean', 'score': 0.999},
#   'pass': True
# }
```

### Pipeline Integration (bash)

```bash
# Step 0 in run_pubverse_pipeline.sh:
PDF_TEXT=$(python3 -c "import fitz; d=fitz.open('$pdf'); print(' '.join(p.get_text() for p in d)[:8000])")
PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
# exit 0 = pass, exit 1 = reject
```

### Installation

```bash
pip install git+https://github.com/jimnoneill/pubguard.git
```

With training dependencies:

```bash
pip install "pubguard[train] @ git+https://github.com/jimnoneill/pubguard.git"
```

## Training Data

Trained on real datasets from HuggingFace — **zero synthetic junk data**:

| Head | Sources | Samples |
|------|---------|---------|
| **doc_type** | armanc/scientific_papers, gfissore/arxiv-abstracts-2021, ag_news, [poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data) | ~55K |
| **ai_detect** | liamdugan/raid (abstracts), NicolaiSivesind/ChatGPT-Research-Abstracts | ~30K |
| **toxicity** | google/civil_comments, skg/toxigen-data | ~30K |

The poster class uses real scientific poster text from the [posters.science](https://posters.science) corpus (28K+ verified posters from Zenodo & Figshare), extracted by [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry).

### Training

```bash
python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000
```

Training completes in ~1 minute on CPU. No GPU needed.

## Model Specifications

| Attribute | Value |
|-----------|-------|
| Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
| Embedding dimension | 512 |
| Structural features | 14 (doc_type head only) |
| Classifier | LogisticRegression (sklearn) per head |
| Head file sizes | 5–9 KB each (.npz) |
| Total model size | ~125 MB (embedding) + 20 KB (heads) |
| Precision | float32 |
| GPU required | No (CPU-only) |
| License | MIT |

## Citation

```bibtex
@software{pubguard_2026,
  title = {PubGuard: Multi-Head Scientific Publication Gatekeeper},
  author = {O'Neill, James},
  year = {2026},
  url = {https://huggingface.co/jimnoneill/pubguard-classifier},
  note = {Part of the PubVerse + 42DeepThought pipeline}
}
```

## License

This model is released under the [MIT License](https://opensource.org/licenses/MIT).

## Acknowledgments

- California Medical Innovations Institute (CalMI²)
- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
- [FAIR Data Innovations Hub](https://fairdataihub.org/) for the [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry) training data
- HuggingFace for model hosting infrastructure