jimnoneill
/

pubguard-classifier

Text Classification

document-classification

scientific-papers

toxicity-detection

publication-screening

quality-control

Model card Files Files and versions

jimnoneill commited on 1 day ago

Commit

9690fd1

·

verified ·

1 Parent(s): a967d10

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +101 -0

README.md ADDED Viewed

	@@ -0,0 +1,101 @@

+---
+license: mit
+language:
+  - en
+tags:
+  - document-classification
+  - scientific-papers
+  - ai-detection
+  - toxicity-detection
+  - model2vec
+  - pubverse
+library_name: model2vec
+pipeline_tag: text-classification
+---
+# PubGuard — Multi-Head Scientific Publication Gatekeeper
+PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text
+to determine whether it represents a genuine scientific publication. It runs as a
+gate in the [PubVerse](https://github.com/jimnoneill) pipeline, rejecting junk
+(flyers, invoices, posters) before expensive downstream processing.
+## Architecture
+Three linear classification heads on frozen [model2vec](https://github.com/MinishLab/model2vec)
+(potion-base-32M) embeddings:
+| Head | Classes | Accuracy | Description |
+|------|---------|----------|-------------|
+| **doc_type** | 4 | 99.9% | scientific_paper \| poster \| abstract_only \| junk |
+| **ai_detect** | 2 | 83.4% | human \| ai_generated |
+| **toxicity** | 2 | 84.7% | clean \| toxic |
+Each head is a single linear layer stored as a numpy `.npz` file (8-12 KB each).
+Inference is pure numpy — no torch needed at prediction time.
+## Performance
+- **302 docs/sec** single-document, **568 docs/sec** batched (CPU only)
+- **3.3ms** per PDF screening — negligible pipeline overhead
+- No GPU required
+## Usage
+```python
+from pubguard import PubGuard
+guard = PubGuard()
+guard.initialize()
+verdict = guard.screen("Introduction: We present a novel deep learning approach...")
+# {
+#   'doc_type': {'label': 'scientific_paper', 'score': 0.994},
+#   'ai_generated': {'label': 'human', 'score': 0.875},
+#   'toxicity': {'label': 'clean', 'score': 0.999},
+#   'pass': True
+# }
+```
+## Pipeline Integration
+```bash
+# In run_pubverse_pipeline.sh:
+PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
+# stdout: PV-0000 | ALL_CLEAR | Welcome to the lab.
+# exit 0 = pass, exit 1 = reject
+```
+## Error Codes
+PubGuard error codes encode the classifier predictions directly:
+`PV-0[doc_type][ai_detect][toxicity]`
+- `PV-0000` — PASS: scientific_paper + human + clean
+- `PV-0300` — Junk detected
+- `PV-0100` — Poster presentation
+- `PV-0200` — Abstract only (no full paper)
+See `ERRORS.md` for the complete (and snarky) error code reference.
+## Training Data
+Trained on datasets from HuggingFace (15K samples/class):
+- **doc_type**: armanc/scientific_papers + gfissore/arxiv-abstracts-2021 + ag_news + synthetic
+- **ai_detect**: liamdugan/raid (abstracts) + NicolaiSivesind/ChatGPT-Research-Abstracts
+- **toxicity**: google/civil_comments + skg/toxigen-data
+## Training
+```bash
+cd pub_check
+pip install -e ".[train]"
+python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000
+```
+Training completes in ~1 minute on CPU. No GPU needed.
+## Citation
+Part of the PubVerse + 42DeepThought pipeline by Jamey O'Neill (CALMI2).