jimnoneill commited on
Commit
f2aa9cc
·
verified ·
1 Parent(s): f67948a

Updated README — logo, architecture diagram, real training data, gate logic

Browse files
Files changed (1) hide show
  1. README.md +124 -23
README.md CHANGED
@@ -9,39 +9,96 @@ tags:
9
  - toxicity-detection
10
  - model2vec
11
  - pubverse
 
 
12
  library_name: model2vec
13
  pipeline_tag: text-classification
 
14
  ---
15
 
 
 
 
 
16
  # PubGuard — Multi-Head Scientific Publication Gatekeeper
17
 
18
- PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text
19
- to determine whether it represents a genuine scientific publication. It runs as a
20
- gate in the [PubVerse](https://github.com/jimnoneill) pipeline, rejecting junk
21
- (flyers, invoices, posters) before expensive downstream processing.
 
 
 
 
 
 
 
22
 
23
  ## Architecture
24
 
25
- Three linear classification heads on frozen [model2vec](https://github.com/MinishLab/model2vec)
26
- (potion-base-32M) embeddings:
27
 
28
- | Head | Classes | Accuracy | Description |
29
- |------|---------|----------|-------------|
30
- | **doc_type** | 4 | 99.9% | scientific_paper \| poster \| abstract_only \| junk |
31
- | **ai_detect** | 2 | 83.4% | human \| ai_generated |
32
- | **toxicity** | 2 | 84.7% | clean \| toxic |
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- Each head is a single linear layer stored as a numpy `.npz` file (8-12 KB each).
35
- Inference is pure numpy — no torch needed at prediction time.
 
36
 
37
  ## Performance
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  - **302 docs/sec** single-document, **568 docs/sec** batched (CPU only)
40
  - **3.3ms** per PDF screening — negligible pipeline overhead
41
  - No GPU required
42
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ## Usage
44
 
 
 
45
  ```python
46
  from pubguard import PubGuard
47
 
@@ -49,6 +106,7 @@ guard = PubGuard()
49
  guard.initialize()
50
 
51
  verdict = guard.screen("Introduction: We present a novel deep learning approach...")
 
52
  # {
53
  # 'doc_type': {'label': 'scientific_paper', 'score': 0.994},
54
  # 'ai_generated': {'label': 'human', 'score': 0.875},
@@ -57,32 +115,75 @@ verdict = guard.screen("Introduction: We present a novel deep learning approach.
57
  # }
58
  ```
59
 
60
- ## Pipeline Integration
61
 
62
  ```bash
63
- # In run_pubverse_pipeline.sh:
 
64
  PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
65
  # exit 0 = pass, exit 1 = reject
66
  ```
67
 
 
 
 
 
 
 
 
68
  ## Training Data
69
 
70
- Trained on datasets from HuggingFace (15K samples/class):
71
 
72
- - **doc_type**: armanc/scientific_papers + gfissore/arxiv-abstracts-2021 + ag_news + synthetic
73
- - **ai_detect**: liamdugan/raid (abstracts) + NicolaiSivesind/ChatGPT-Research-Abstracts
74
- - **toxicity**: google/civil_comments + skg/toxigen-data
 
 
75
 
76
- ## Training
 
 
77
 
78
  ```bash
79
- cd pub_check
80
- pip install -e ".[train]"
81
  python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000
82
  ```
83
 
84
  Training completes in ~1 minute on CPU. No GPU needed.
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  ## Citation
87
 
88
- Part of the PubVerse + 42DeepThought pipeline by Jamey O'Neill (CALMI2).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - toxicity-detection
10
  - model2vec
11
  - pubverse
12
+ - publication-screening
13
+ - quality-control
14
  library_name: model2vec
15
  pipeline_tag: text-classification
16
+ thumbnail: PubGuard.png
17
  ---
18
 
19
+ <div align="center">
20
+ <img src="PubGuard.png" alt="PubGuard Logo" width="400"/>
21
+ </div>
22
+
23
  # PubGuard — Multi-Head Scientific Publication Gatekeeper
24
 
25
+ ## Model Description
26
+
27
+ PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It runs as **Step 0** in the PubVerse + 42DeepThought pipeline, rejecting junk (flyers, invoices, non-scholarly PDFs) before expensive downstream processing (VLM feature extraction, graph construction, GNN scoring).
28
+
29
+ Three classification heads provide a multi-dimensional screening verdict:
30
+
31
+ 1. **Document type** — Is this a paper, poster, abstract, or junk?
32
+ 2. **AI detection** — Was this written by a human or generated by an LLM?
33
+ 3. **Toxicity** — Does this contain toxic or offensive content?
34
+
35
+ Developed by Jamey O'Neill at the California Medical Innovations Institute (CalMI²).
36
 
37
  ## Architecture
38
 
39
+ Three linear classification heads on frozen [model2vec](https://github.com/MinishLab/model2vec) (potion-base-32M) embeddings:
 
40
 
41
+ ```
42
+ ┌─────────────┐
43
+ │ PDF text │
44
+ └──────┬──────┘
45
+
46
+ ┌──────▼──────┐ ┌───────────────────┐
47
+ │ clean_text │────►│ model2vec encode │──► emb ∈ R^512
48
+ └─────────────┘ └───────────────────┘
49
+
50
+ ┌─────────────────┼─────────────────┐
51
+ ▼ ▼ ▼
52
+ ┌─────────────────┐ ┌──────────────┐ ┌──────────────┐
53
+ │ doc_type head │ │ ai_detect │ │ toxicity │
54
+ │ [emb + 14 feats] │ │ head │ │ head │
55
+ │ → softmax(4) │ │ → softmax(2) │ │ → softmax(2) │
56
+ └─────────────────┘ └──────────────┘ └──────────────┘
57
+ ```
58
 
59
+ Each head is a single linear layer stored as a numpy `.npz` file (812 KB). Inference is pure numpy — no torch needed at prediction time.
60
+
61
+ The `doc_type` head additionally receives 14 structural features (section headings present, citation density, sentence length, etc.) concatenated with the embedding — these act as strong Bayesian priors.
62
 
63
  ## Performance
64
 
65
+ | Head | Classes | Accuracy | F1 |
66
+ |------|---------|----------|-----|
67
+ | **doc_type** | 4 | **99.7%** | 0.997 |
68
+ | **ai_detect** | 2 | 83.4% | 0.834 |
69
+ | **toxicity** | 2 | 84.7% | 0.847 |
70
+
71
+ ### doc_type Breakdown
72
+
73
+ | Class | Precision | Recall | F1 |
74
+ |-------|-----------|--------|-----|
75
+ | scientific_paper | 1.000 | 1.000 | 1.000 |
76
+ | poster | 0.989 | 0.974 | 0.981 |
77
+ | abstract_only | 0.997 | 0.997 | 0.997 |
78
+ | junk | 0.993 | 0.998 | 0.996 |
79
+
80
+ ### Throughput
81
+
82
  - **302 docs/sec** single-document, **568 docs/sec** batched (CPU only)
83
  - **3.3ms** per PDF screening — negligible pipeline overhead
84
  - No GPU required
85
 
86
+ ## Gate Logic
87
+
88
+ Both `scientific_paper` and `poster` classifications **pass** the gate (both are valid scientific content). Only `abstract_only` and `junk` are blocked:
89
+
90
+ ```python
91
+ verdict = guard.screen(text)
92
+ # verdict['pass'] = True if doc_type in ('scientific_paper', 'poster')
93
+ # verdict['pass'] = False if doc_type in ('abstract_only', 'junk')
94
+ ```
95
+
96
+ AI detection and toxicity are **informational by default** — reported but not blocking.
97
+
98
  ## Usage
99
 
100
+ ### Python API
101
+
102
  ```python
103
  from pubguard import PubGuard
104
 
 
106
  guard.initialize()
107
 
108
  verdict = guard.screen("Introduction: We present a novel deep learning approach...")
109
+ print(verdict)
110
  # {
111
  # 'doc_type': {'label': 'scientific_paper', 'score': 0.994},
112
  # 'ai_generated': {'label': 'human', 'score': 0.875},
 
115
  # }
116
  ```
117
 
118
+ ### Pipeline Integration (bash)
119
 
120
  ```bash
121
+ # Step 0 in run_pubverse_pipeline.sh:
122
+ PDF_TEXT=$(python3 -c "import fitz; d=fitz.open('$pdf'); print(' '.join(p.get_text() for p in d)[:8000])")
123
  PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
124
  # exit 0 = pass, exit 1 = reject
125
  ```
126
 
127
+ ### Installation
128
+
129
+ ```bash
130
+ cd pub_check
131
+ pip install -e ".[train]"
132
+ ```
133
+
134
  ## Training Data
135
 
136
+ Trained on real datasets from HuggingFace **zero synthetic junk data**:
137
 
138
+ | Head | Sources | Samples |
139
+ |------|---------|---------|
140
+ | **doc_type** | armanc/scientific_papers, gfissore/arxiv-abstracts-2021, ag_news, [poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data) | ~55K |
141
+ | **ai_detect** | liamdugan/raid (abstracts), NicolaiSivesind/ChatGPT-Research-Abstracts | ~30K |
142
+ | **toxicity** | google/civil_comments, skg/toxigen-data | ~30K |
143
 
144
+ The poster class uses real scientific poster text from the [posters.science](https://posters.science) corpus (28K+ verified posters from Zenodo & Figshare), extracted by [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry).
145
+
146
+ ### Training
147
 
148
  ```bash
 
 
149
  python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000
150
  ```
151
 
152
  Training completes in ~1 minute on CPU. No GPU needed.
153
 
154
+ ## Model Specifications
155
+
156
+ | Attribute | Value |
157
+ |-----------|-------|
158
+ | Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
159
+ | Embedding dimension | 512 |
160
+ | Structural features | 14 (doc_type head only) |
161
+ | Classifier | LogisticRegression (sklearn) per head |
162
+ | Head file sizes | 5–9 KB each (.npz) |
163
+ | Total model size | ~125 MB (embedding) + 20 KB (heads) |
164
+ | Precision | float32 |
165
+ | GPU required | No (CPU-only) |
166
+ | License | MIT |
167
+
168
  ## Citation
169
 
170
+ ```bibtex
171
+ @software{pubguard_2026,
172
+ title = {PubGuard: Multi-Head Scientific Publication Gatekeeper},
173
+ author = {O'Neill, James},
174
+ year = {2026},
175
+ url = {https://huggingface.co/jimnoneill/pubguard-classifier},
176
+ note = {Part of the PubVerse + 42DeepThought pipeline}
177
+ }
178
+ ```
179
+
180
+ ## License
181
+
182
+ This model is released under the [MIT License](https://opensource.org/licenses/MIT).
183
+
184
+ ## Acknowledgments
185
+
186
+ - California Medical Innovations Institute (CalMI²)
187
+ - [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
188
+ - [FAIR Data Innovations Hub](https://fairdataihub.org/) for the [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry) training data
189
+ - HuggingFace for model hosting infrastructure