jimnoneill commited on
Commit
2ed413f
Β·
verified Β·
1 Parent(s): 4f15b47

Upload ERRORS.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. ERRORS.md +219 -0
ERRORS.md ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PubVerse Error Codes
2
+
3
+ > *"Every rejection is a learning opportunity. Mostly for the person who tried to sneak a pool party flyer into a metagenomics pipeline."*
4
+
5
+ ## Error Code Format
6
+
7
+ ```
8
+ PV-SXNN
9
+ β”‚ β”‚β”‚β””β”˜
10
+ β”‚ β”‚β”‚ └─ Detail code (00-99)
11
+ β”‚ │└─── Sub-category
12
+ β”‚ └──── Step number (0-8)
13
+ └────── PubVerse prefix
14
+ ```
15
+
16
+ All error codes are printed to **stdout** as a single line:
17
+
18
+ ```
19
+ PV-0301 | JUNK_DETECTED | That's not a paper, that's a cry for help (score=1.000)
20
+ ```
21
+
22
+ ---
23
+
24
+ ## Step 0 β€” PubGuard Screening
25
+
26
+ The bouncer at the door. If your PDF can't get past here, it definitely
27
+ shouldn't be anywhere near a GNN.
28
+
29
+ The PubGuard codes embed the classifier indices directly:
30
+ `PV-0{doc_type}{ai_detect}{toxicity}` where each digit is the
31
+ predicted class index.
32
+
33
+ | Code | Name | What Happened |
34
+ |------|------|---------------|
35
+ | **PV-0000** | `ALL_CLEAR` | Paper passed screening. Welcome to the lab. |
36
+ | **PV-0100** | `POSTER_DETECTED` | That's a poster, not a paper. We appreciate the aesthetic effort, but we need Methods, not bullet points on a corkboard. |
37
+ | **PV-0200** | `ABSTRACT_ONLY` | We got the trailer but not the movie. Where's the rest of the paper? |
38
+ | **PV-0300** | `JUNK_DETECTED` | That's not a paper, that's a cry for help. Pool party invitations, invoices, and fantasy football drafts do not constitute peer-reviewed research. |
39
+ | **PV-0010** | `AI_GENERATED` | Our classifier thinks a robot wrote this. Not necessarily disqualifying, but noted for the record. The Turing test starts at the Introduction. |
40
+ | **PV-0001** | `TOXIC_CONTENT` | Content flagged as potentially toxic. Science should be provocative, not offensive. |
41
+ | **PV-0310** | `JUNK_AND_AI` | AI-generated junk. Congratulations, you've automated mediocrity. |
42
+ | **PV-0301** | `JUNK_AND_TOXIC` | Toxic junk. This is somehow worse than a pool party flyer. |
43
+ | **PV-0311** | `JUNK_AI_TOXIC` | The trifecta. AI-generated toxic junk. We'd be impressed if we weren't horrified. |
44
+ | **PV-0110** | `POSTER_AND_AI` | An AI-generated poster. The future is here and it's making conference posters. |
45
+ | **PV-0210** | `ABSTRACT_AI` | An AI-generated abstract with no paper attached. Peak efficiency. |
46
+
47
+ ### Composite Code Encoding
48
+
49
+ The three middle digits encode each classifier head's prediction index:
50
+
51
+ ```
52
+ PV-0 [doc_type] [ai_detect] [toxicity] NN
53
+ β”‚ β”‚ β”‚
54
+ β”‚ β”‚ └─ 0=clean, 1=toxic
55
+ β”‚ └───────────── 0=human, 1=ai_generated
56
+ └──────────────────────── 0=scientific_paper, 1=poster,
57
+ 2=abstract_only, 3=junk
58
+ ```
59
+
60
+ So `PV-0000` = scientific_paper + human + clean = **PASS**.
61
+ Any non-zero digit in the doc_type position = **hard gate (blocked)**.
62
+ Non-zero in ai/toxicity = **soft flag (reported, not blocked by default)**.
63
+
64
+ ### Special PubGuard Codes
65
+
66
+ | Code | Name | Description |
67
+ |------|------|-------------|
68
+ | **PV-0900** | `EMPTY_INPUT` | You sent us nothing. Literally nothing. The void does not require peer review. |
69
+ | **PV-0901** | `UNREADABLE_PDF` | We can't read this PDF. Neither could PyMuPDF. If your PDF parser can't parse it, maybe it's not a PDF. |
70
+ | **PV-0902** | `MODELS_MISSING` | PubGuard models not found. Run training first: `cd pub_check && python scripts/train_pubguard.py` |
71
+ | **PV-0999** | `GATE_BYPASSED` | PubGuard screening was skipped (PUBGUARD_STRICT=0). Proceeding on faith. Good luck. |
72
+
73
+ ---
74
+
75
+ ## Step 1 β€” PDF Feature Extraction
76
+
77
+ Where we turn your lovingly formatted PDF into tab-separated values.
78
+ The VLM does its best.
79
+
80
+ | Code | Name | Description |
81
+ |------|------|-------------|
82
+ | **PV-1000** | `EXTRACTION_OK` | Features extracted successfully. The VLM read your paper and didn't crash. |
83
+ | **PV-1100** | `EXTRACTION_FAILED` | VLM feature extraction failed. Your PDF defeated a 7-billion parameter model. Impressive, actually. |
84
+ | **PV-1101** | `NO_TSV_OUTPUT` | Extraction ran but produced no output file. The VLM started reading and apparently gave up. |
85
+ | **PV-1102** | `VLM_TIMEOUT` | Feature extraction timed out. Your PDF is either very long or very confusing. |
86
+ | **PV-1200** | `VLM_NOT_FOUND` | Feature extraction script not found. Did someone move `qwen3_local_feature_extraction_cli.py`? |
87
+
88
+ ---
89
+
90
+ ## Step 2 β€” PubVerse Analysis
91
+
92
+ The main event. Clustering, impact analysis, topic modeling β€” the works.
93
+
94
+ | Code | Name | Description |
95
+ |------|------|-------------|
96
+ | **PV-2000** | `ANALYSIS_OK` | PubVerse analysis completed. Your paper has been weighed, measured, and clustered. |
97
+ | **PV-2100** | `ANALYSIS_FAILED` | App_v5.py crashed. This is the big one β€” check the logs. |
98
+ | **PV-2101** | `STOPWORDS_MISSING` | Stopwords pickle not found. You can't do NLP without knowing which words to ignore. |
99
+ | **PV-2102** | `DATASET2_MISSING` | Reference dataset not found. We need something to compare your paper against. |
100
+ | **PV-2103** | `RECYCLE_MISSING` | Recycle overlay pickle not found. The cached cluster state is gone. Time to rebuild (grab coffee). |
101
+
102
+ ---
103
+
104
+ ## Step 3 β€” Artifact Verification
105
+
106
+ Making sure Step 2 actually produced what it promised.
107
+
108
+ | Code | Name | Description |
109
+ |------|------|-------------|
110
+ | **PV-3000** | `ARTIFACTS_OK` | All expected files found. The pipeline is holding together. |
111
+ | **PV-3100** | `MATRIX_MISSING` | Unified adjacency matrix not found. PubVerse ran but didn't produce the matrix. Something went sideways in clustering. |
112
+ | **PV-3101** | `PICKLE_MISSING` | Impact analysis pickle not found. The most important intermediate file is AWOL. |
113
+
114
+ ---
115
+
116
+ ## Step 4 β€” Graph Construction
117
+
118
+ Building the knowledge graph. Nodes, edges, the whole beautiful mess.
119
+
120
+ | Code | Name | Description |
121
+ |------|------|-------------|
122
+ | **PV-4000** | `GRAPH_OK` | Knowledge graph constructed. It's a beautiful web of science. |
123
+ | **PV-4100** | `GRAPH_FAILED` | Graph construction failed. 42dt_graph_clean_cli.py didn't make it. |
124
+ | **PV-4101** | `GRAPH_PICKLE_MISSING` | Graph pickle not produced. The nodes existed briefly, like a postdoc's optimism. |
125
+
126
+ ---
127
+
128
+ ## Step 5 β€” 42DeepThought Scoring
129
+
130
+ The GNN scores your paper. This is where GPUs earn their electricity bill.
131
+
132
+ | Code | Name | Description |
133
+ |------|------|-------------|
134
+ | **PV-5000** | `SCORING_OK` | 42DeepThought scored your paper. May the odds be in your favor. |
135
+ | **PV-5100** | `SCORING_FAILED` | GNN scoring crashed. Check CUDA, check the graph, check your assumptions. |
136
+ | **PV-5101** | `NO_GRAPH_PICKLE` | Graph pickle file not found for scoring. Step 4 must have failed silently. |
137
+ | **PV-5102** | `NO_SCORES_OUTPUT` | Scoring ran but produced no TSV. The GNN had nothing to say about your paper. |
138
+ | **PV-5103** | `CHECKPOINT_CORRUPT` | Model checkpoint failed to load. Delete `deepthought_model.pt` and retrain. |
139
+ | **PV-5200** | `DEEPTHOUGHT_MISSING` | 42DeepThought directory not found. The entire scoring engine is missing. |
140
+ | **PV-5201** | `LABELS_MISSING` | `42d_scoring.tsv` not found. No reference labels means no supervised scoring. |
141
+
142
+ ---
143
+
144
+ ## Step 6 β€” Cluster Similarity Analysis
145
+
146
+ Finding your paper's neighbors in topic space.
147
+
148
+ | Code | Name | Description |
149
+ |------|------|-------------|
150
+ | **PV-6000** | `CLUSTER_OK` | Cluster analysis complete. Your paper's social circle has been mapped. |
151
+ | **PV-6100** | `CLUSTER_FAILED` | Cluster analysis crashed. Your paper is a loner β€” or the code is. |
152
+ | **PV-6101** | `NO_SANITY_PICKLE` | Sanitycheck pickle not found. Can't do cluster analysis without clusters. |
153
+ | **PV-6102** | `DB_TIMEOUT` | Cluster database population timed out (>1 hour). The LLM is still thinking. |
154
+ | **PV-6103** | `NO_QUERY_ID` | Could not extract query paper ID from TSV. Who are you, even? |
155
+ | **PV-6200** | `CLUSTER_SKIPPED` | Cluster analysis skipped (prerequisites not met). Not fatal, just lonely. |
156
+
157
+ ---
158
+
159
+ ## Step 7 β€” Data Enrichment
160
+
161
+ Merging all the analysis results into one JSON payload.
162
+
163
+ | Code | Name | Description |
164
+ |------|------|-------------|
165
+ | **PV-7000** | `ENRICH_OK` | Data enrichment complete. Everything is unified and beautiful. |
166
+ | **PV-7100** | `ENRICH_FAILED` | Enrichment script crashed. The data refused to be unified. |
167
+ | **PV-7200** | `ENRICH_SKIPPED` | Enrichment skipped (cluster analysis didn't complete). Can't enrich what doesn't exist. |
168
+
169
+ ---
170
+
171
+ ## Step 8 β€” Interactive Visualization
172
+
173
+ The grand finale. Turning data into something you can click on.
174
+
175
+ | Code | Name | Description |
176
+ |------|------|-------------|
177
+ | **PV-8000** | `VIZ_OK` | Visualization generated. Open the HTML and admire your knowledge graph. |
178
+ | **PV-8100** | `VIZ_FAILED` | Visualization generation failed. You'll have to imagine the graph. |
179
+ | **PV-8200** | `VIZ_SKIPPED` | Visualization skipped (enrichment didn't complete). No data, no graph, no glory. |
180
+
181
+ ---
182
+
183
+ ## Exit Code Summary
184
+
185
+ The pipeline exits with the **first fatal error code's step number**:
186
+
187
+ | Exit Code | Meaning |
188
+ |-----------|---------|
189
+ | `0` | Success β€” all steps completed (or non-fatal warnings only) |
190
+ | `1` | Step 0 β€” PubGuard rejected the input |
191
+ | `2` | Step 1 β€” Feature extraction failed |
192
+ | `3` | Step 2 β€” PubVerse analysis failed |
193
+ | `4` | Step 3 β€” Required artifacts missing |
194
+ | `5` | Step 4 β€” Graph construction failed |
195
+ | `6` | Step 5 β€” 42DeepThought scoring failed |
196
+ | `7` | Step 6 β€” Cluster analysis failed (fatal only if no fallback) |
197
+ | `8` | Step 7/8 β€” Enrichment or visualization failed |
198
+
199
+ ---
200
+
201
+ ## Interpreting PubGuard Composite Codes
202
+
203
+ Quick decoder ring for the `PV-0XYZ` codes:
204
+
205
+ ```
206
+ X (doc_type): 0=paper βœ… 1=poster πŸ“‹ 2=abstract πŸ“„ 3=junk πŸ—‘οΈ
207
+ Y (ai_detect): 0=human ✍️ 1=ai πŸ€–
208
+ Z (toxicity): 0=clean βœ… 1=toxic ☠️
209
+ ```
210
+
211
+ Examples:
212
+ - `PV-0000` β†’ Paper + Human + Clean β†’ **PASS** βœ…
213
+ - `PV-0300` β†’ Junk + Human + Clean β†’ *"That's not a paper"* πŸ—‘οΈ
214
+ - `PV-0011` β†’ Paper + AI + Toxic β†’ *"Human-passing but spicy"* ⚠️
215
+ - `PV-0311` β†’ Junk + AI + Toxic β†’ *"The absolute worst"* 🚫
216
+
217
+ ---
218
+
219
+ *PubGuard v0.1.0 β€” Because science has standards, even if your PDF doesn't.*