jimnoneill
/

pubguard-classifier

@@ -9,39 +9,96 @@ tags:
   - toxicity-detection
   - model2vec
   - pubverse
 library_name: model2vec
 pipeline_tag: text-classification
 ---
 # PubGuard — Multi-Head Scientific Publication Gatekeeper
-PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text
-to determine whether it represents a genuine scientific publication. It runs as a
-gate in the [PubVerse](https://github.com/jimnoneill) pipeline, rejecting junk
-(flyers, invoices, posters) before expensive downstream processing.
 ## Architecture
-Three linear classification heads on frozen [model2vec](https://github.com/MinishLab/model2vec)
-(potion-base-32M) embeddings:
-| Head | Classes | Accuracy | Description |
-|------|---------|----------|-------------|
-| **doc_type** | 4 | 99.9% | scientific_paper \| poster \| abstract_only \| junk |
-| **ai_detect** | 2 | 83.4% | human \| ai_generated |
-| **toxicity** | 2 | 84.7% | clean \| toxic |
-Each head is a single linear layer stored as a numpy `.npz` file (8-12 KB each).
-Inference is pure numpy — no torch needed at prediction time.
 ## Performance
 - **302 docs/sec** single-document, **568 docs/sec** batched (CPU only)
 - **3.3ms** per PDF screening — negligible pipeline overhead
 - No GPU required
 ## Usage
 ```python
 from pubguard import PubGuard
@@ -49,6 +106,7 @@ guard = PubGuard()
 guard.initialize()
 verdict = guard.screen("Introduction: We present a novel deep learning approach...")
 # {
 #   'doc_type': {'label': 'scientific_paper', 'score': 0.994},
 #   'ai_generated': {'label': 'human', 'score': 0.875},
@@ -57,32 +115,75 @@ verdict = guard.screen("Introduction: We present a novel deep learning approach.
 # }
 ```
-## Pipeline Integration
 ```bash
-# In run_pubverse_pipeline.sh:
 PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
 # exit 0 = pass, exit 1 = reject
 ```
 ## Training Data
-Trained on datasets from HuggingFace (15K samples/class):
-- **doc_type**: armanc/scientific_papers + gfissore/arxiv-abstracts-2021 + ag_news + synthetic
-- **ai_detect**: liamdugan/raid (abstracts) + NicolaiSivesind/ChatGPT-Research-Abstracts
-- **toxicity**: google/civil_comments + skg/toxigen-data
-## Training
 ```bash
-cd pub_check
-pip install -e ".[train]"
 python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000
 ```
 Training completes in ~1 minute on CPU. No GPU needed.
 ## Citation
-Part of the PubVerse + 42DeepThought pipeline by Jamey O'Neill (CALMI2).

   - toxicity-detection
   - model2vec
   - pubverse
+  - publication-screening
+  - quality-control
 library_name: model2vec
 pipeline_tag: text-classification
+thumbnail: PubGuard.png
 ---
+<div align="center">
+  <img src="PubGuard.png" alt="PubGuard Logo" width="400"/>
+</div>
 # PubGuard — Multi-Head Scientific Publication Gatekeeper
+## Model Description
+PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It runs as **Step 0** in the PubVerse + 42DeepThought pipeline, rejecting junk (flyers, invoices, non-scholarly PDFs) before expensive downstream processing (VLM feature extraction, graph construction, GNN scoring).
+Three classification heads provide a multi-dimensional screening verdict:
+1. **Document type** — Is this a paper, poster, abstract, or junk?
+2. **AI detection** — Was this written by a human or generated by an LLM?
+3. **Toxicity** — Does this contain toxic or offensive content?
+Developed by Jamey O'Neill at the California Medical Innovations Institute (CalMI²).
 ## Architecture
+Three linear classification heads on frozen [model2vec](https://github.com/MinishLab/model2vec) (potion-base-32M) embeddings:
+```
+┌─────────────┐
+│  PDF text    │
+└──────┬──────┘
+       │
+┌──────▼──────┐     ┌───────────────────┐
+│  clean_text │────►│  model2vec encode  │──► emb ∈ R^512
+└─────────────┘     └───────────────────┘
+                            │
+          ┌─────────────────┼─────────────────┐
+          ▼                 ▼                  ▼
+┌─────────────────┐ ┌──────────────┐ ┌──────────────┐
+│ doc_type head    │ │ ai_detect    │ │ toxicity     │
+│ [emb + 14 feats] │ │ head         │ │ head         │
+│ → softmax(4)     │ │ → softmax(2) │ │ → softmax(2) │
+└─────────────────┘ └──────────────┘ └──────────────┘
+```
+Each head is a single linear layer stored as a numpy `.npz` file (8–12 KB). Inference is pure numpy — no torch needed at prediction time.
+The `doc_type` head additionally receives 14 structural features (section headings present, citation density, sentence length, etc.) concatenated with the embedding — these act as strong Bayesian priors.
 ## Performance
+| Head | Classes | Accuracy | F1 |
+|------|---------|----------|-----|
+| **doc_type** | 4 | **99.7%** | 0.997 |
+| **ai_detect** | 2 | 83.4% | 0.834 |
+| **toxicity** | 2 | 84.7% | 0.847 |
+### doc_type Breakdown
+| Class | Precision | Recall | F1 |
+|-------|-----------|--------|-----|
+| scientific_paper | 1.000 | 1.000 | 1.000 |
+| poster | 0.989 | 0.974 | 0.981 |
+| abstract_only | 0.997 | 0.997 | 0.997 |
+| junk | 0.993 | 0.998 | 0.996 |
+### Throughput
 - **302 docs/sec** single-document, **568 docs/sec** batched (CPU only)
 - **3.3ms** per PDF screening — negligible pipeline overhead
 - No GPU required
+## Gate Logic
+Both `scientific_paper` and `poster` classifications **pass** the gate (both are valid scientific content). Only `abstract_only` and `junk` are blocked:
+```python
+verdict = guard.screen(text)
+# verdict['pass'] = True  if doc_type in ('scientific_paper', 'poster')
+# verdict['pass'] = False if doc_type in ('abstract_only', 'junk')
+```
+AI detection and toxicity are **informational by default** — reported but not blocking.
 ## Usage
+### Python API
 ```python
 from pubguard import PubGuard
 guard.initialize()
 verdict = guard.screen("Introduction: We present a novel deep learning approach...")
+print(verdict)
 # {
 #   'doc_type': {'label': 'scientific_paper', 'score': 0.994},
 #   'ai_generated': {'label': 'human', 'score': 0.875},
 # }
 ```
+### Pipeline Integration (bash)
 ```bash
+# Step 0 in run_pubverse_pipeline.sh:
+PDF_TEXT=$(python3 -c "import fitz; d=fitz.open('$pdf'); print(' '.join(p.get_text() for p in d)[:8000])")
 PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
 # exit 0 = pass, exit 1 = reject
 ```
+### Installation
+```bash
+cd pub_check
+pip install -e ".[train]"
+```
 ## Training Data
+Trained on real datasets from HuggingFace — **zero synthetic junk data**:
+| Head | Sources | Samples |
+|------|---------|---------|
+| **doc_type** | armanc/scientific_papers, gfissore/arxiv-abstracts-2021, ag_news, [poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data) | ~55K |
+| **ai_detect** | liamdugan/raid (abstracts), NicolaiSivesind/ChatGPT-Research-Abstracts | ~30K |
+| **toxicity** | google/civil_comments, skg/toxigen-data | ~30K |
+The poster class uses real scientific poster text from the [posters.science](https://posters.science) corpus (28K+ verified posters from Zenodo & Figshare), extracted by [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry).
+### Training
 ```bash
 python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000
 ```
 Training completes in ~1 minute on CPU. No GPU needed.
+## Model Specifications
+| Attribute | Value |
+|-----------|-------|
+| Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
+| Embedding dimension | 512 |
+| Structural features | 14 (doc_type head only) |
+| Classifier | LogisticRegression (sklearn) per head |
+| Head file sizes | 5–9 KB each (.npz) |
+| Total model size | ~125 MB (embedding) + 20 KB (heads) |
+| Precision | float32 |
+| GPU required | No (CPU-only) |
+| License | MIT |
 ## Citation
+```bibtex
+@software{pubguard_2026,
+  title = {PubGuard: Multi-Head Scientific Publication Gatekeeper},
+  author = {O'Neill, James},
+  year = {2026},
+  url = {https://huggingface.co/jimnoneill/pubguard-classifier},
+  note = {Part of the PubVerse + 42DeepThought pipeline}
+}
+```
+## License
+This model is released under the [MIT License](https://opensource.org/licenses/MIT).
+## Acknowledgments
+- California Medical Innovations Institute (CalMI²)
+- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
+- [FAIR Data Innovations Hub](https://fairdataihub.org/) for the [PosterSentry](https://huggingface.co/fairdataihub/poster-sentry) training data
+- HuggingFace for model hosting infrastructure