File size: 7,009 Bytes
9690fd1
 
 
 
 
 
 
 
 
 
 
f2aa9cc
 
9690fd1
 
f2aa9cc
9690fd1
 
f2aa9cc
 
 
 
9690fd1
 
f2aa9cc
 
181aa2b
f2aa9cc
 
 
 
 
 
 
 
9690fd1
 
 
f2aa9cc
9690fd1
f2aa9cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9690fd1
f2aa9cc
 
 
9690fd1
 
 
f2aa9cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9690fd1
 
 
 
f2aa9cc
 
181aa2b
f2aa9cc
181aa2b
 
 
 
 
f2aa9cc
 
 
 
9690fd1
 
f2aa9cc
 
9690fd1
 
 
 
 
 
 
f2aa9cc
9690fd1
 
 
 
 
 
 
 
f2aa9cc
9690fd1
 
f2aa9cc
 
9690fd1
 
 
 
f2aa9cc
 
 
b5810c8
 
 
 
 
 
 
f2aa9cc
 
9690fd1
 
f2aa9cc
9690fd1
f2aa9cc
 
 
 
 
9690fd1
f2aa9cc
 
 
9690fd1
 
 
 
 
 
 
f2aa9cc
 
 
 
 
 
 
 
 
 
 
 
 
 
9690fd1
 
f2aa9cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
license: mit
language:
  - en
tags:
  - document-classification
  - scientific-papers
  - ai-detection
  - toxicity-detection
  - model2vec
  - pubverse
  - publication-screening
  - quality-control
library_name: model2vec
pipeline_tag: text-classification
thumbnail: PubGuard.png
---

<div align="center">
  <img src="PubGuard.png" alt="PubGuard Logo" width="400"/>
</div>

# PubGuard β€” Multi-Head Scientific Publication Gatekeeper

## Model Description

PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It runs as **Step 0** in the PubVerse + 42DeepThought pipeline, rejecting non-publications (posters, abstracts, flyers, invoices) before expensive downstream processing (VLM feature extraction, graph construction, GNN scoring).

Three classification heads provide a multi-dimensional screening verdict:

1. **Document type** β€” Is this a paper, poster, abstract, or junk?
2. **AI detection** β€” Was this written by a human or generated by an LLM?
3. **Toxicity** β€” Does this contain toxic or offensive content?

Developed by Jamey O'Neill at the California Medical Innovations Institute (CalMIΒ²).

## Architecture

Three linear classification heads on frozen [model2vec](https://github.com/MinishLab/model2vec) (potion-base-32M) embeddings:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PDF text    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  clean_text │────►│  model2vec encode  │──► emb ∈ R^512
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό                 β–Ό                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ doc_type head    β”‚ β”‚ ai_detect    β”‚ β”‚ toxicity     β”‚
β”‚ [emb + 14 feats] β”‚ β”‚ head         β”‚ β”‚ head         β”‚
β”‚ β†’ softmax(4)     β”‚ β”‚ β†’ softmax(2) β”‚ β”‚ β†’ softmax(2) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

Each head is a single linear layer stored as a numpy `.npz` file (8–12 KB). Inference is pure numpy β€” no torch needed at prediction time.

The `doc_type` head additionally receives 14 structural features (section headings present, citation density, sentence length, etc.) concatenated with the embedding β€” these act as strong Bayesian priors.

## Performance

| Head | Classes | Accuracy | F1 |
|------|---------|----------|-----|
| **doc_type** | 4 | **99.7%** | 0.997 |
| **ai_detect** | 2 | 83.4% | 0.834 |
| **toxicity** | 2 | 84.7% | 0.847 |

### doc_type Breakdown

| Class | Precision | Recall | F1 |
|-------|-----------|--------|-----|
| scientific_paper | 1.000 | 1.000 | 1.000 |
| poster | 0.989 | 0.974 | 0.981 |
| abstract_only | 0.997 | 0.997 | 0.997 |
| junk | 0.993 | 0.998 | 0.996 |

### Throughput

- **302 docs/sec** single-document, **568 docs/sec** batched (CPU only)
- **3.3ms** per PDF screening β€” negligible pipeline overhead
- No GPU required

## Gate Logic

Only `scientific_paper` passes the gate. Everything else β€” posters, standalone abstracts, junk β€” is blocked. The PubVerse pipeline processes **publications only**.

```
scientific_paper  β†’  βœ… PASS
poster            β†’  ❌ BLOCKED  (classified, but not a publication)
abstract_only     β†’  ❌ BLOCKED
junk              β†’  ❌ BLOCKED
```

AI detection and toxicity are **informational by default** β€” reported but not blocking.

## Usage

### Python API

```python
from pubguard import PubGuard

guard = PubGuard()
guard.initialize()

verdict = guard.screen("Introduction: We present a novel deep learning approach...")
print(verdict)
# {
#   'doc_type': {'label': 'scientific_paper', 'score': 0.994},
#   'ai_generated': {'label': 'human', 'score': 0.875},
#   'toxicity': {'label': 'clean', 'score': 0.999},
#   'pass': True
# }
```

### Pipeline Integration (bash)

```bash
# Step 0 in run_pubverse_pipeline.sh:
PDF_TEXT=$(python3 -c "import fitz; d=fitz.open('$pdf'); print(' '.join(p.get_text() for p in d)[:8000])")
PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
# exit 0 = pass, exit 1 = reject
```

### Installation

```bash
pip install git+https://github.com/jimnoneill/pubguard.git
```

With training dependencies:

```bash
pip install "pubguard[train] @ git+https://github.com/jimnoneill/pubguard.git"
```

## Training Data

Trained on real datasets from HuggingFace β€” **zero synthetic junk data**:

| Head | Sources | Samples |
|------|---------|---------|
| **doc_type** | armanc/scientific_papers, gfissore/arxiv-abstracts-2021, ag_news, [poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data) | ~55K |
| **ai_detect** | liamdugan/raid (abstracts), NicolaiSivesind/ChatGPT-Research-Abstracts | ~30K |
| **toxicity** | google/civil_comments, skg/toxigen-data | ~30K |

The poster class uses real scientific poster text from the [posters.science](https://posters.science) corpus (28K+ verified posters from Zenodo & Figshare), extracted by [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry).

### Training

```bash
python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000
```

Training completes in ~1 minute on CPU. No GPU needed.

## Model Specifications

| Attribute | Value |
|-----------|-------|
| Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
| Embedding dimension | 512 |
| Structural features | 14 (doc_type head only) |
| Classifier | LogisticRegression (sklearn) per head |
| Head file sizes | 5–9 KB each (.npz) |
| Total model size | ~125 MB (embedding) + 20 KB (heads) |
| Precision | float32 |
| GPU required | No (CPU-only) |
| License | MIT |

## Citation

```bibtex
@software{pubguard_2026,
  title = {PubGuard: Multi-Head Scientific Publication Gatekeeper},
  author = {O'Neill, James},
  year = {2026},
  url = {https://huggingface.co/jimnoneill/pubguard-classifier},
  note = {Part of the PubVerse + 42DeepThought pipeline}
}
```

## License

This model is released under the [MIT License](https://opensource.org/licenses/MIT).

## Acknowledgments

- California Medical Innovations Institute (CalMIΒ²)
- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
- [FAIR Data Innovations Hub](https://fairdataihub.org/) for the [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry) training data
- HuggingFace for model hosting infrastructure