Remove ERRORS.md (internal only)
Browse files
ERRORS.md
DELETED
|
@@ -1,219 +0,0 @@
|
|
| 1 |
-
# PubVerse Error Codes
|
| 2 |
-
|
| 3 |
-
> *"Every rejection is a learning opportunity. Mostly for the person who tried to sneak a pool party flyer into a metagenomics pipeline."*
|
| 4 |
-
|
| 5 |
-
## Error Code Format
|
| 6 |
-
|
| 7 |
-
```
|
| 8 |
-
PV-SXNN
|
| 9 |
-
β ββββ
|
| 10 |
-
β ββ ββ Detail code (00-99)
|
| 11 |
-
β βββββ Sub-category
|
| 12 |
-
β βββββ Step number (0-8)
|
| 13 |
-
βββββββ PubVerse prefix
|
| 14 |
-
```
|
| 15 |
-
|
| 16 |
-
All error codes are printed to **stdout** as a single line:
|
| 17 |
-
|
| 18 |
-
```
|
| 19 |
-
PV-0301 | JUNK_DETECTED | That's not a paper, that's a cry for help (score=1.000)
|
| 20 |
-
```
|
| 21 |
-
|
| 22 |
-
---
|
| 23 |
-
|
| 24 |
-
## Step 0 β PubGuard Screening
|
| 25 |
-
|
| 26 |
-
The bouncer at the door. If your PDF can't get past here, it definitely
|
| 27 |
-
shouldn't be anywhere near a GNN.
|
| 28 |
-
|
| 29 |
-
The PubGuard codes embed the classifier indices directly:
|
| 30 |
-
`PV-0{doc_type}{ai_detect}{toxicity}` where each digit is the
|
| 31 |
-
predicted class index.
|
| 32 |
-
|
| 33 |
-
| Code | Name | What Happened |
|
| 34 |
-
|------|------|---------------|
|
| 35 |
-
| **PV-0000** | `ALL_CLEAR` | Paper passed screening. Welcome to the lab. |
|
| 36 |
-
| **PV-0100** | `POSTER_DETECTED` | That's a poster, not a paper. We appreciate the aesthetic effort, but we need Methods, not bullet points on a corkboard. |
|
| 37 |
-
| **PV-0200** | `ABSTRACT_ONLY` | We got the trailer but not the movie. Where's the rest of the paper? |
|
| 38 |
-
| **PV-0300** | `JUNK_DETECTED` | That's not a paper, that's a cry for help. Pool party invitations, invoices, and fantasy football drafts do not constitute peer-reviewed research. |
|
| 39 |
-
| **PV-0010** | `AI_GENERATED` | Our classifier thinks a robot wrote this. Not necessarily disqualifying, but noted for the record. The Turing test starts at the Introduction. |
|
| 40 |
-
| **PV-0001** | `TOXIC_CONTENT` | Content flagged as potentially toxic. Science should be provocative, not offensive. |
|
| 41 |
-
| **PV-0310** | `JUNK_AND_AI` | AI-generated junk. Congratulations, you've automated mediocrity. |
|
| 42 |
-
| **PV-0301** | `JUNK_AND_TOXIC` | Toxic junk. This is somehow worse than a pool party flyer. |
|
| 43 |
-
| **PV-0311** | `JUNK_AI_TOXIC` | The trifecta. AI-generated toxic junk. We'd be impressed if we weren't horrified. |
|
| 44 |
-
| **PV-0110** | `POSTER_AND_AI` | An AI-generated poster. The future is here and it's making conference posters. |
|
| 45 |
-
| **PV-0210** | `ABSTRACT_AI` | An AI-generated abstract with no paper attached. Peak efficiency. |
|
| 46 |
-
|
| 47 |
-
### Composite Code Encoding
|
| 48 |
-
|
| 49 |
-
The three middle digits encode each classifier head's prediction index:
|
| 50 |
-
|
| 51 |
-
```
|
| 52 |
-
PV-0 [doc_type] [ai_detect] [toxicity] NN
|
| 53 |
-
β β β
|
| 54 |
-
β β ββ 0=clean, 1=toxic
|
| 55 |
-
β ββββββββββββββ 0=human, 1=ai_generated
|
| 56 |
-
βββββββββββββββββββββββββ 0=scientific_paper, 1=poster,
|
| 57 |
-
2=abstract_only, 3=junk
|
| 58 |
-
```
|
| 59 |
-
|
| 60 |
-
So `PV-0000` = scientific_paper + human + clean = **PASS**.
|
| 61 |
-
Any non-zero digit in the doc_type position = **hard gate (blocked)**.
|
| 62 |
-
Non-zero in ai/toxicity = **soft flag (reported, not blocked by default)**.
|
| 63 |
-
|
| 64 |
-
### Special PubGuard Codes
|
| 65 |
-
|
| 66 |
-
| Code | Name | Description |
|
| 67 |
-
|------|------|-------------|
|
| 68 |
-
| **PV-0900** | `EMPTY_INPUT` | You sent us nothing. Literally nothing. The void does not require peer review. |
|
| 69 |
-
| **PV-0901** | `UNREADABLE_PDF` | We can't read this PDF. Neither could PyMuPDF. If your PDF parser can't parse it, maybe it's not a PDF. |
|
| 70 |
-
| **PV-0902** | `MODELS_MISSING` | PubGuard models not found. Run training first: `cd pub_check && python scripts/train_pubguard.py` |
|
| 71 |
-
| **PV-0999** | `GATE_BYPASSED` | PubGuard screening was skipped (PUBGUARD_STRICT=0). Proceeding on faith. Good luck. |
|
| 72 |
-
|
| 73 |
-
---
|
| 74 |
-
|
| 75 |
-
## Step 1 β PDF Feature Extraction
|
| 76 |
-
|
| 77 |
-
Where we turn your lovingly formatted PDF into tab-separated values.
|
| 78 |
-
The VLM does its best.
|
| 79 |
-
|
| 80 |
-
| Code | Name | Description |
|
| 81 |
-
|------|------|-------------|
|
| 82 |
-
| **PV-1000** | `EXTRACTION_OK` | Features extracted successfully. The VLM read your paper and didn't crash. |
|
| 83 |
-
| **PV-1100** | `EXTRACTION_FAILED` | VLM feature extraction failed. Your PDF defeated a 7-billion parameter model. Impressive, actually. |
|
| 84 |
-
| **PV-1101** | `NO_TSV_OUTPUT` | Extraction ran but produced no output file. The VLM started reading and apparently gave up. |
|
| 85 |
-
| **PV-1102** | `VLM_TIMEOUT` | Feature extraction timed out. Your PDF is either very long or very confusing. |
|
| 86 |
-
| **PV-1200** | `VLM_NOT_FOUND` | Feature extraction script not found. Did someone move `qwen3_local_feature_extraction_cli.py`? |
|
| 87 |
-
|
| 88 |
-
---
|
| 89 |
-
|
| 90 |
-
## Step 2 β PubVerse Analysis
|
| 91 |
-
|
| 92 |
-
The main event. Clustering, impact analysis, topic modeling β the works.
|
| 93 |
-
|
| 94 |
-
| Code | Name | Description |
|
| 95 |
-
|------|------|-------------|
|
| 96 |
-
| **PV-2000** | `ANALYSIS_OK` | PubVerse analysis completed. Your paper has been weighed, measured, and clustered. |
|
| 97 |
-
| **PV-2100** | `ANALYSIS_FAILED` | App_v5.py crashed. This is the big one β check the logs. |
|
| 98 |
-
| **PV-2101** | `STOPWORDS_MISSING` | Stopwords pickle not found. You can't do NLP without knowing which words to ignore. |
|
| 99 |
-
| **PV-2102** | `DATASET2_MISSING` | Reference dataset not found. We need something to compare your paper against. |
|
| 100 |
-
| **PV-2103** | `RECYCLE_MISSING` | Recycle overlay pickle not found. The cached cluster state is gone. Time to rebuild (grab coffee). |
|
| 101 |
-
|
| 102 |
-
---
|
| 103 |
-
|
| 104 |
-
## Step 3 β Artifact Verification
|
| 105 |
-
|
| 106 |
-
Making sure Step 2 actually produced what it promised.
|
| 107 |
-
|
| 108 |
-
| Code | Name | Description |
|
| 109 |
-
|------|------|-------------|
|
| 110 |
-
| **PV-3000** | `ARTIFACTS_OK` | All expected files found. The pipeline is holding together. |
|
| 111 |
-
| **PV-3100** | `MATRIX_MISSING` | Unified adjacency matrix not found. PubVerse ran but didn't produce the matrix. Something went sideways in clustering. |
|
| 112 |
-
| **PV-3101** | `PICKLE_MISSING` | Impact analysis pickle not found. The most important intermediate file is AWOL. |
|
| 113 |
-
|
| 114 |
-
---
|
| 115 |
-
|
| 116 |
-
## Step 4 β Graph Construction
|
| 117 |
-
|
| 118 |
-
Building the knowledge graph. Nodes, edges, the whole beautiful mess.
|
| 119 |
-
|
| 120 |
-
| Code | Name | Description |
|
| 121 |
-
|------|------|-------------|
|
| 122 |
-
| **PV-4000** | `GRAPH_OK` | Knowledge graph constructed. It's a beautiful web of science. |
|
| 123 |
-
| **PV-4100** | `GRAPH_FAILED` | Graph construction failed. 42dt_graph_clean_cli.py didn't make it. |
|
| 124 |
-
| **PV-4101** | `GRAPH_PICKLE_MISSING` | Graph pickle not produced. The nodes existed briefly, like a postdoc's optimism. |
|
| 125 |
-
|
| 126 |
-
---
|
| 127 |
-
|
| 128 |
-
## Step 5 β 42DeepThought Scoring
|
| 129 |
-
|
| 130 |
-
The GNN scores your paper. This is where GPUs earn their electricity bill.
|
| 131 |
-
|
| 132 |
-
| Code | Name | Description |
|
| 133 |
-
|------|------|-------------|
|
| 134 |
-
| **PV-5000** | `SCORING_OK` | 42DeepThought scored your paper. May the odds be in your favor. |
|
| 135 |
-
| **PV-5100** | `SCORING_FAILED` | GNN scoring crashed. Check CUDA, check the graph, check your assumptions. |
|
| 136 |
-
| **PV-5101** | `NO_GRAPH_PICKLE` | Graph pickle file not found for scoring. Step 4 must have failed silently. |
|
| 137 |
-
| **PV-5102** | `NO_SCORES_OUTPUT` | Scoring ran but produced no TSV. The GNN had nothing to say about your paper. |
|
| 138 |
-
| **PV-5103** | `CHECKPOINT_CORRUPT` | Model checkpoint failed to load. Delete `deepthought_model.pt` and retrain. |
|
| 139 |
-
| **PV-5200** | `DEEPTHOUGHT_MISSING` | 42DeepThought directory not found. The entire scoring engine is missing. |
|
| 140 |
-
| **PV-5201** | `LABELS_MISSING` | `42d_scoring.tsv` not found. No reference labels means no supervised scoring. |
|
| 141 |
-
|
| 142 |
-
---
|
| 143 |
-
|
| 144 |
-
## Step 6 β Cluster Similarity Analysis
|
| 145 |
-
|
| 146 |
-
Finding your paper's neighbors in topic space.
|
| 147 |
-
|
| 148 |
-
| Code | Name | Description |
|
| 149 |
-
|------|------|-------------|
|
| 150 |
-
| **PV-6000** | `CLUSTER_OK` | Cluster analysis complete. Your paper's social circle has been mapped. |
|
| 151 |
-
| **PV-6100** | `CLUSTER_FAILED` | Cluster analysis crashed. Your paper is a loner β or the code is. |
|
| 152 |
-
| **PV-6101** | `NO_SANITY_PICKLE` | Sanitycheck pickle not found. Can't do cluster analysis without clusters. |
|
| 153 |
-
| **PV-6102** | `DB_TIMEOUT` | Cluster database population timed out (>1 hour). The LLM is still thinking. |
|
| 154 |
-
| **PV-6103** | `NO_QUERY_ID` | Could not extract query paper ID from TSV. Who are you, even? |
|
| 155 |
-
| **PV-6200** | `CLUSTER_SKIPPED` | Cluster analysis skipped (prerequisites not met). Not fatal, just lonely. |
|
| 156 |
-
|
| 157 |
-
---
|
| 158 |
-
|
| 159 |
-
## Step 7 β Data Enrichment
|
| 160 |
-
|
| 161 |
-
Merging all the analysis results into one JSON payload.
|
| 162 |
-
|
| 163 |
-
| Code | Name | Description |
|
| 164 |
-
|------|------|-------------|
|
| 165 |
-
| **PV-7000** | `ENRICH_OK` | Data enrichment complete. Everything is unified and beautiful. |
|
| 166 |
-
| **PV-7100** | `ENRICH_FAILED` | Enrichment script crashed. The data refused to be unified. |
|
| 167 |
-
| **PV-7200** | `ENRICH_SKIPPED` | Enrichment skipped (cluster analysis didn't complete). Can't enrich what doesn't exist. |
|
| 168 |
-
|
| 169 |
-
---
|
| 170 |
-
|
| 171 |
-
## Step 8 β Interactive Visualization
|
| 172 |
-
|
| 173 |
-
The grand finale. Turning data into something you can click on.
|
| 174 |
-
|
| 175 |
-
| Code | Name | Description |
|
| 176 |
-
|------|------|-------------|
|
| 177 |
-
| **PV-8000** | `VIZ_OK` | Visualization generated. Open the HTML and admire your knowledge graph. |
|
| 178 |
-
| **PV-8100** | `VIZ_FAILED` | Visualization generation failed. You'll have to imagine the graph. |
|
| 179 |
-
| **PV-8200** | `VIZ_SKIPPED` | Visualization skipped (enrichment didn't complete). No data, no graph, no glory. |
|
| 180 |
-
|
| 181 |
-
---
|
| 182 |
-
|
| 183 |
-
## Exit Code Summary
|
| 184 |
-
|
| 185 |
-
The pipeline exits with the **first fatal error code's step number**:
|
| 186 |
-
|
| 187 |
-
| Exit Code | Meaning |
|
| 188 |
-
|-----------|---------|
|
| 189 |
-
| `0` | Success β all steps completed (or non-fatal warnings only) |
|
| 190 |
-
| `1` | Step 0 β PubGuard rejected the input |
|
| 191 |
-
| `2` | Step 1 β Feature extraction failed |
|
| 192 |
-
| `3` | Step 2 β PubVerse analysis failed |
|
| 193 |
-
| `4` | Step 3 β Required artifacts missing |
|
| 194 |
-
| `5` | Step 4 β Graph construction failed |
|
| 195 |
-
| `6` | Step 5 β 42DeepThought scoring failed |
|
| 196 |
-
| `7` | Step 6 β Cluster analysis failed (fatal only if no fallback) |
|
| 197 |
-
| `8` | Step 7/8 β Enrichment or visualization failed |
|
| 198 |
-
|
| 199 |
-
---
|
| 200 |
-
|
| 201 |
-
## Interpreting PubGuard Composite Codes
|
| 202 |
-
|
| 203 |
-
Quick decoder ring for the `PV-0XYZ` codes:
|
| 204 |
-
|
| 205 |
-
```
|
| 206 |
-
X (doc_type): 0=paper β
1=poster π 2=abstract π 3=junk ποΈ
|
| 207 |
-
Y (ai_detect): 0=human βοΈ 1=ai π€
|
| 208 |
-
Z (toxicity): 0=clean β
1=toxic β οΈ
|
| 209 |
-
```
|
| 210 |
-
|
| 211 |
-
Examples:
|
| 212 |
-
- `PV-0000` β Paper + Human + Clean β **PASS** β
|
| 213 |
-
- `PV-0300` β Junk + Human + Clean β *"That's not a paper"* ποΈ
|
| 214 |
-
- `PV-0011` β Paper + AI + Toxic β *"Human-passing but spicy"* β οΈ
|
| 215 |
-
- `PV-0311` β Junk + AI + Toxic β *"The absolute worst"* π«
|
| 216 |
-
|
| 217 |
-
---
|
| 218 |
-
|
| 219 |
-
*PubGuard v0.1.0 β Because science has standards, even if your PDF doesn't.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|