jimnoneill commited on
Commit
ec0dccd
Β·
verified Β·
1 Parent(s): 2ed413f

Remove ERRORS.md (internal only)

Browse files
Files changed (1) hide show
  1. ERRORS.md +0 -219
ERRORS.md DELETED
@@ -1,219 +0,0 @@
1
- # PubVerse Error Codes
2
-
3
- > *"Every rejection is a learning opportunity. Mostly for the person who tried to sneak a pool party flyer into a metagenomics pipeline."*
4
-
5
- ## Error Code Format
6
-
7
- ```
8
- PV-SXNN
9
- β”‚ β”‚β”‚β””β”˜
10
- β”‚ β”‚β”‚ └─ Detail code (00-99)
11
- β”‚ │└─── Sub-category
12
- β”‚ └──── Step number (0-8)
13
- └────── PubVerse prefix
14
- ```
15
-
16
- All error codes are printed to **stdout** as a single line:
17
-
18
- ```
19
- PV-0301 | JUNK_DETECTED | That's not a paper, that's a cry for help (score=1.000)
20
- ```
21
-
22
- ---
23
-
24
- ## Step 0 β€” PubGuard Screening
25
-
26
- The bouncer at the door. If your PDF can't get past here, it definitely
27
- shouldn't be anywhere near a GNN.
28
-
29
- The PubGuard codes embed the classifier indices directly:
30
- `PV-0{doc_type}{ai_detect}{toxicity}` where each digit is the
31
- predicted class index.
32
-
33
- | Code | Name | What Happened |
34
- |------|------|---------------|
35
- | **PV-0000** | `ALL_CLEAR` | Paper passed screening. Welcome to the lab. |
36
- | **PV-0100** | `POSTER_DETECTED` | That's a poster, not a paper. We appreciate the aesthetic effort, but we need Methods, not bullet points on a corkboard. |
37
- | **PV-0200** | `ABSTRACT_ONLY` | We got the trailer but not the movie. Where's the rest of the paper? |
38
- | **PV-0300** | `JUNK_DETECTED` | That's not a paper, that's a cry for help. Pool party invitations, invoices, and fantasy football drafts do not constitute peer-reviewed research. |
39
- | **PV-0010** | `AI_GENERATED` | Our classifier thinks a robot wrote this. Not necessarily disqualifying, but noted for the record. The Turing test starts at the Introduction. |
40
- | **PV-0001** | `TOXIC_CONTENT` | Content flagged as potentially toxic. Science should be provocative, not offensive. |
41
- | **PV-0310** | `JUNK_AND_AI` | AI-generated junk. Congratulations, you've automated mediocrity. |
42
- | **PV-0301** | `JUNK_AND_TOXIC` | Toxic junk. This is somehow worse than a pool party flyer. |
43
- | **PV-0311** | `JUNK_AI_TOXIC` | The trifecta. AI-generated toxic junk. We'd be impressed if we weren't horrified. |
44
- | **PV-0110** | `POSTER_AND_AI` | An AI-generated poster. The future is here and it's making conference posters. |
45
- | **PV-0210** | `ABSTRACT_AI` | An AI-generated abstract with no paper attached. Peak efficiency. |
46
-
47
- ### Composite Code Encoding
48
-
49
- The three middle digits encode each classifier head's prediction index:
50
-
51
- ```
52
- PV-0 [doc_type] [ai_detect] [toxicity] NN
53
- β”‚ β”‚ β”‚
54
- β”‚ β”‚ └─ 0=clean, 1=toxic
55
- β”‚ └───────────── 0=human, 1=ai_generated
56
- └──────────────────────── 0=scientific_paper, 1=poster,
57
- 2=abstract_only, 3=junk
58
- ```
59
-
60
- So `PV-0000` = scientific_paper + human + clean = **PASS**.
61
- Any non-zero digit in the doc_type position = **hard gate (blocked)**.
62
- Non-zero in ai/toxicity = **soft flag (reported, not blocked by default)**.
63
-
64
- ### Special PubGuard Codes
65
-
66
- | Code | Name | Description |
67
- |------|------|-------------|
68
- | **PV-0900** | `EMPTY_INPUT` | You sent us nothing. Literally nothing. The void does not require peer review. |
69
- | **PV-0901** | `UNREADABLE_PDF` | We can't read this PDF. Neither could PyMuPDF. If your PDF parser can't parse it, maybe it's not a PDF. |
70
- | **PV-0902** | `MODELS_MISSING` | PubGuard models not found. Run training first: `cd pub_check && python scripts/train_pubguard.py` |
71
- | **PV-0999** | `GATE_BYPASSED` | PubGuard screening was skipped (PUBGUARD_STRICT=0). Proceeding on faith. Good luck. |
72
-
73
- ---
74
-
75
- ## Step 1 β€” PDF Feature Extraction
76
-
77
- Where we turn your lovingly formatted PDF into tab-separated values.
78
- The VLM does its best.
79
-
80
- | Code | Name | Description |
81
- |------|------|-------------|
82
- | **PV-1000** | `EXTRACTION_OK` | Features extracted successfully. The VLM read your paper and didn't crash. |
83
- | **PV-1100** | `EXTRACTION_FAILED` | VLM feature extraction failed. Your PDF defeated a 7-billion parameter model. Impressive, actually. |
84
- | **PV-1101** | `NO_TSV_OUTPUT` | Extraction ran but produced no output file. The VLM started reading and apparently gave up. |
85
- | **PV-1102** | `VLM_TIMEOUT` | Feature extraction timed out. Your PDF is either very long or very confusing. |
86
- | **PV-1200** | `VLM_NOT_FOUND` | Feature extraction script not found. Did someone move `qwen3_local_feature_extraction_cli.py`? |
87
-
88
- ---
89
-
90
- ## Step 2 β€” PubVerse Analysis
91
-
92
- The main event. Clustering, impact analysis, topic modeling β€” the works.
93
-
94
- | Code | Name | Description |
95
- |------|------|-------------|
96
- | **PV-2000** | `ANALYSIS_OK` | PubVerse analysis completed. Your paper has been weighed, measured, and clustered. |
97
- | **PV-2100** | `ANALYSIS_FAILED` | App_v5.py crashed. This is the big one β€” check the logs. |
98
- | **PV-2101** | `STOPWORDS_MISSING` | Stopwords pickle not found. You can't do NLP without knowing which words to ignore. |
99
- | **PV-2102** | `DATASET2_MISSING` | Reference dataset not found. We need something to compare your paper against. |
100
- | **PV-2103** | `RECYCLE_MISSING` | Recycle overlay pickle not found. The cached cluster state is gone. Time to rebuild (grab coffee). |
101
-
102
- ---
103
-
104
- ## Step 3 β€” Artifact Verification
105
-
106
- Making sure Step 2 actually produced what it promised.
107
-
108
- | Code | Name | Description |
109
- |------|------|-------------|
110
- | **PV-3000** | `ARTIFACTS_OK` | All expected files found. The pipeline is holding together. |
111
- | **PV-3100** | `MATRIX_MISSING` | Unified adjacency matrix not found. PubVerse ran but didn't produce the matrix. Something went sideways in clustering. |
112
- | **PV-3101** | `PICKLE_MISSING` | Impact analysis pickle not found. The most important intermediate file is AWOL. |
113
-
114
- ---
115
-
116
- ## Step 4 β€” Graph Construction
117
-
118
- Building the knowledge graph. Nodes, edges, the whole beautiful mess.
119
-
120
- | Code | Name | Description |
121
- |------|------|-------------|
122
- | **PV-4000** | `GRAPH_OK` | Knowledge graph constructed. It's a beautiful web of science. |
123
- | **PV-4100** | `GRAPH_FAILED` | Graph construction failed. 42dt_graph_clean_cli.py didn't make it. |
124
- | **PV-4101** | `GRAPH_PICKLE_MISSING` | Graph pickle not produced. The nodes existed briefly, like a postdoc's optimism. |
125
-
126
- ---
127
-
128
- ## Step 5 β€” 42DeepThought Scoring
129
-
130
- The GNN scores your paper. This is where GPUs earn their electricity bill.
131
-
132
- | Code | Name | Description |
133
- |------|------|-------------|
134
- | **PV-5000** | `SCORING_OK` | 42DeepThought scored your paper. May the odds be in your favor. |
135
- | **PV-5100** | `SCORING_FAILED` | GNN scoring crashed. Check CUDA, check the graph, check your assumptions. |
136
- | **PV-5101** | `NO_GRAPH_PICKLE` | Graph pickle file not found for scoring. Step 4 must have failed silently. |
137
- | **PV-5102** | `NO_SCORES_OUTPUT` | Scoring ran but produced no TSV. The GNN had nothing to say about your paper. |
138
- | **PV-5103** | `CHECKPOINT_CORRUPT` | Model checkpoint failed to load. Delete `deepthought_model.pt` and retrain. |
139
- | **PV-5200** | `DEEPTHOUGHT_MISSING` | 42DeepThought directory not found. The entire scoring engine is missing. |
140
- | **PV-5201** | `LABELS_MISSING` | `42d_scoring.tsv` not found. No reference labels means no supervised scoring. |
141
-
142
- ---
143
-
144
- ## Step 6 β€” Cluster Similarity Analysis
145
-
146
- Finding your paper's neighbors in topic space.
147
-
148
- | Code | Name | Description |
149
- |------|------|-------------|
150
- | **PV-6000** | `CLUSTER_OK` | Cluster analysis complete. Your paper's social circle has been mapped. |
151
- | **PV-6100** | `CLUSTER_FAILED` | Cluster analysis crashed. Your paper is a loner β€” or the code is. |
152
- | **PV-6101** | `NO_SANITY_PICKLE` | Sanitycheck pickle not found. Can't do cluster analysis without clusters. |
153
- | **PV-6102** | `DB_TIMEOUT` | Cluster database population timed out (>1 hour). The LLM is still thinking. |
154
- | **PV-6103** | `NO_QUERY_ID` | Could not extract query paper ID from TSV. Who are you, even? |
155
- | **PV-6200** | `CLUSTER_SKIPPED` | Cluster analysis skipped (prerequisites not met). Not fatal, just lonely. |
156
-
157
- ---
158
-
159
- ## Step 7 β€” Data Enrichment
160
-
161
- Merging all the analysis results into one JSON payload.
162
-
163
- | Code | Name | Description |
164
- |------|------|-------------|
165
- | **PV-7000** | `ENRICH_OK` | Data enrichment complete. Everything is unified and beautiful. |
166
- | **PV-7100** | `ENRICH_FAILED` | Enrichment script crashed. The data refused to be unified. |
167
- | **PV-7200** | `ENRICH_SKIPPED` | Enrichment skipped (cluster analysis didn't complete). Can't enrich what doesn't exist. |
168
-
169
- ---
170
-
171
- ## Step 8 β€” Interactive Visualization
172
-
173
- The grand finale. Turning data into something you can click on.
174
-
175
- | Code | Name | Description |
176
- |------|------|-------------|
177
- | **PV-8000** | `VIZ_OK` | Visualization generated. Open the HTML and admire your knowledge graph. |
178
- | **PV-8100** | `VIZ_FAILED` | Visualization generation failed. You'll have to imagine the graph. |
179
- | **PV-8200** | `VIZ_SKIPPED` | Visualization skipped (enrichment didn't complete). No data, no graph, no glory. |
180
-
181
- ---
182
-
183
- ## Exit Code Summary
184
-
185
- The pipeline exits with the **first fatal error code's step number**:
186
-
187
- | Exit Code | Meaning |
188
- |-----------|---------|
189
- | `0` | Success β€” all steps completed (or non-fatal warnings only) |
190
- | `1` | Step 0 β€” PubGuard rejected the input |
191
- | `2` | Step 1 β€” Feature extraction failed |
192
- | `3` | Step 2 β€” PubVerse analysis failed |
193
- | `4` | Step 3 β€” Required artifacts missing |
194
- | `5` | Step 4 β€” Graph construction failed |
195
- | `6` | Step 5 β€” 42DeepThought scoring failed |
196
- | `7` | Step 6 β€” Cluster analysis failed (fatal only if no fallback) |
197
- | `8` | Step 7/8 β€” Enrichment or visualization failed |
198
-
199
- ---
200
-
201
- ## Interpreting PubGuard Composite Codes
202
-
203
- Quick decoder ring for the `PV-0XYZ` codes:
204
-
205
- ```
206
- X (doc_type): 0=paper βœ… 1=poster πŸ“‹ 2=abstract πŸ“„ 3=junk πŸ—‘οΈ
207
- Y (ai_detect): 0=human ✍️ 1=ai πŸ€–
208
- Z (toxicity): 0=clean βœ… 1=toxic ☠️
209
- ```
210
-
211
- Examples:
212
- - `PV-0000` β†’ Paper + Human + Clean β†’ **PASS** βœ…
213
- - `PV-0300` β†’ Junk + Human + Clean β†’ *"That's not a paper"* πŸ—‘οΈ
214
- - `PV-0011` β†’ Paper + AI + Toxic β†’ *"Human-passing but spicy"* ⚠️
215
- - `PV-0311` β†’ Junk + AI + Toxic β†’ *"The absolute worst"* 🚫
216
-
217
- ---
218
-
219
- *PubGuard v0.1.0 β€” Because science has standards, even if your PDF doesn't.*