jimnoneill commited on
Commit
9690fd1
·
verified ·
1 Parent(s): a967d10

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +101 -0
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - document-classification
7
+ - scientific-papers
8
+ - ai-detection
9
+ - toxicity-detection
10
+ - model2vec
11
+ - pubverse
12
+ library_name: model2vec
13
+ pipeline_tag: text-classification
14
+ ---
15
+
16
+ # PubGuard — Multi-Head Scientific Publication Gatekeeper
17
+
18
+ PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text
19
+ to determine whether it represents a genuine scientific publication. It runs as a
20
+ gate in the [PubVerse](https://github.com/jimnoneill) pipeline, rejecting junk
21
+ (flyers, invoices, posters) before expensive downstream processing.
22
+
23
+ ## Architecture
24
+
25
+ Three linear classification heads on frozen [model2vec](https://github.com/MinishLab/model2vec)
26
+ (potion-base-32M) embeddings:
27
+
28
+ | Head | Classes | Accuracy | Description |
29
+ |------|---------|----------|-------------|
30
+ | **doc_type** | 4 | 99.9% | scientific_paper \| poster \| abstract_only \| junk |
31
+ | **ai_detect** | 2 | 83.4% | human \| ai_generated |
32
+ | **toxicity** | 2 | 84.7% | clean \| toxic |
33
+
34
+ Each head is a single linear layer stored as a numpy `.npz` file (8-12 KB each).
35
+ Inference is pure numpy — no torch needed at prediction time.
36
+
37
+ ## Performance
38
+
39
+ - **302 docs/sec** single-document, **568 docs/sec** batched (CPU only)
40
+ - **3.3ms** per PDF screening — negligible pipeline overhead
41
+ - No GPU required
42
+
43
+ ## Usage
44
+
45
+ ```python
46
+ from pubguard import PubGuard
47
+
48
+ guard = PubGuard()
49
+ guard.initialize()
50
+
51
+ verdict = guard.screen("Introduction: We present a novel deep learning approach...")
52
+ # {
53
+ # 'doc_type': {'label': 'scientific_paper', 'score': 0.994},
54
+ # 'ai_generated': {'label': 'human', 'score': 0.875},
55
+ # 'toxicity': {'label': 'clean', 'score': 0.999},
56
+ # 'pass': True
57
+ # }
58
+ ```
59
+
60
+ ## Pipeline Integration
61
+
62
+ ```bash
63
+ # In run_pubverse_pipeline.sh:
64
+ PUBGUARD_CODE=$(echo "$PDF_TEXT" | python3 pub_check/scripts/pubguard_gate.py 2>/dev/null)
65
+ # stdout: PV-0000 | ALL_CLEAR | Welcome to the lab.
66
+ # exit 0 = pass, exit 1 = reject
67
+ ```
68
+
69
+ ## Error Codes
70
+
71
+ PubGuard error codes encode the classifier predictions directly:
72
+ `PV-0[doc_type][ai_detect][toxicity]`
73
+
74
+ - `PV-0000` — PASS: scientific_paper + human + clean
75
+ - `PV-0300` — Junk detected
76
+ - `PV-0100` — Poster presentation
77
+ - `PV-0200` — Abstract only (no full paper)
78
+
79
+ See `ERRORS.md` for the complete (and snarky) error code reference.
80
+
81
+ ## Training Data
82
+
83
+ Trained on datasets from HuggingFace (15K samples/class):
84
+
85
+ - **doc_type**: armanc/scientific_papers + gfissore/arxiv-abstracts-2021 + ag_news + synthetic
86
+ - **ai_detect**: liamdugan/raid (abstracts) + NicolaiSivesind/ChatGPT-Research-Abstracts
87
+ - **toxicity**: google/civil_comments + skg/toxigen-data
88
+
89
+ ## Training
90
+
91
+ ```bash
92
+ cd pub_check
93
+ pip install -e ".[train]"
94
+ python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000
95
+ ```
96
+
97
+ Training completes in ~1 minute on CPU. No GPU needed.
98
+
99
+ ## Citation
100
+
101
+ Part of the PubVerse + 42DeepThought pipeline by Jamey O'Neill (CALMI2).