aivolcano commited on
Commit
5330800
·
2 Parent(s): 3d83b62d60a819

Merge branch 'main' of https://github.com/aivolcano/CiteScan

Browse files
.gitignore CHANGED
@@ -41,6 +41,7 @@ env/
41
  # Project Specific Outputs
42
  *.md
43
  !README.md
 
44
  *_only_used_entry.bib
45
 
46
  # LaTeX and Bibliography (User Data)
@@ -58,4 +59,7 @@ env/
58
  *.fdb_latexmk
59
 
60
  # cache
61
- .cache
 
 
 
 
41
  # Project Specific Outputs
42
  *.md
43
  !README.md
44
+ !docs/**/*.md
45
  *_only_used_entry.bib
46
 
47
  # LaTeX and Bibliography (User Data)
 
59
  *.fdb_latexmk
60
 
61
  # cache
62
+ .cache
63
+
64
+ # Git worktrees
65
+ .worktrees/
app.py CHANGED
@@ -579,19 +579,19 @@ with gr.Blocks(title="CiteScan - Check References, Confirm Truth.", theme=gr.the
579
  *Case Study for False positive* in CiteScan:
580
 
581
  1. **Authors Mismatch**:
582
- - *Reason*: Different databases deal with a longer list of authors with different strategies, like truncation.
583
  - *Action*: Verify if main authors match
584
 
585
  2. **Venues Mismatch**:
586
- - *Reason*: Abbreviations vs. full names, such as "ICLR" v.s. "International Conference on Learning Representations"
587
  - *Action*: Both are correct.
588
 
589
  3. **Year GAP (±1 Year)**:
590
- - *Reason*: Delay between preprint (arXiv) and final version publication
591
  - *Action*: Verify which version you intend to cite, We recommend you to cite the version from the official press website. Less number of pre-print version bibs will make your submission more convincing.
592
 
593
  4. **Non-academic Sources**:
594
- - *Reason*: Blogs, and APIs are not indexed in academic databases.
595
  - *Action*: Verify URL, year, and title manually.
596
  ---
597
  **Supported Data Sources:** arXiv, CrossRef, DBLP, Semantic Scholar, ACL Anthology, ACM, theCVF,
 
579
  *Case Study for False positive* in CiteScan:
580
 
581
  1. **Authors Mismatch**:
582
+ - *Observation*: Different databases deal with a longer list of authors with different strategies, like truncation.
583
  - *Action*: Verify if main authors match
584
 
585
  2. **Venues Mismatch**:
586
+ - *Observation*: Abbreviations vs. full names, such as "ICLR" v.s. "International Conference on Learning Representations"
587
  - *Action*: Both are correct.
588
 
589
  3. **Year GAP (±1 Year)**:
590
+ - *Observation*: Delay between preprint (arXiv) and final version publication
591
  - *Action*: Verify which version you intend to cite, We recommend you to cite the version from the official press website. Less number of pre-print version bibs will make your submission more convincing.
592
 
593
  4. **Non-academic Sources**:
594
+ - *Observation*: Blogs, and APIs are not indexed in academic databases.
595
  - *Action*: Verify URL, year, and title manually.
596
  ---
597
  **Supported Data Sources:** arXiv, CrossRef, DBLP, Semantic Scholar, ACL Anthology, ACM, theCVF,
docs/plans/2026-01-30-arxiv-to-official-conversion-design.md ADDED
@@ -0,0 +1,357 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # arXiv to Official Publication Conversion - Design Document
2
+
3
+ **Date:** 2026-01-30
4
+ **Feature:** Smart BibTeX Cleanup - arXiv → Official Publication Conversion
5
+ **Goal:** Transform CiteScan from a verification tool into an original smart bibliography management system
6
+
7
+ ---
8
+
9
+ ## 1. Overview
10
+
11
+ ### Problem Statement
12
+ Researchers often cite arXiv preprints in their bibliographies, but academic submissions prefer official publication citations (conference/journal versions). Manually finding and updating these citations is tedious and error-prone.
13
+
14
+ ### Solution
15
+ Extend CiteScan to automatically detect when an arXiv preprint has been officially published, and provide users with easy access to both versions so they can copy the official BibTeX from the venue website.
16
+
17
+ ### Key Constraint
18
+ **$0 maintenance cost** - No ongoing API costs, no server infrastructure. Pure algorithmic processing using existing free academic APIs.
19
+
20
+ ---
21
+
22
+ ## 2. Core Architecture
23
+
24
+ ### Current Flow
25
+ ```
26
+ Parse BibTeX → Fetch metadata → Compare & verify → Display results
27
+ ```
28
+
29
+ ### Enhanced Flow
30
+ ```
31
+ Parse BibTeX → Fetch metadata → Compare & verify → Detect arXiv preprints
32
+
33
+ If arXiv detected → Search for official version → Match validation
34
+
35
+ Display results with dual links (arXiv + Official)
36
+ ```
37
+
38
+ ### Components Modified
39
+ 1. **Data Model** (`src/analyzers/metadata_comparator.py`)
40
+ 2. **Detection Logic** (`src/analyzers/metadata_comparator.py`)
41
+ 3. **Workflow** (`app.py` - `process_single_entry()`)
42
+ 4. **UI Rendering** (`app.py` - `format_entry_card()`)
43
+ 5. **CSS Styling** (`app.py` - `REPORT_CSS`)
44
+
45
+ ---
46
+
47
+ ## 3. Detection Logic
48
+
49
+ ### Step 1: Identify arXiv Entries
50
+ An entry is considered an arXiv preprint if:
51
+ - Has `eprint` field with arXiv ID pattern (e.g., "2304.12345")
52
+ - `journal` or `booktitle` field contains "arXiv" or "preprint"
53
+ - URL contains "arxiv.org"
54
+
55
+ **Important:** If venue field contains known conference/journal names (ACL, NeurIPS, ICML, etc.), treat as official publication even if arXiv is mentioned in notes.
56
+
57
+ ### Step 2: Search for Official Version
58
+ Use existing fetchers in priority order:
59
+ 1. **Semantic Scholar** - Query by arXiv ID (best for finding publication venue)
60
+ 2. **CrossRef** - Query by title (best for DOI)
61
+ 3. **DBLP** - Query by title (best for CS conferences)
62
+ 4. **OpenAlex** - Query by DOI or title (comprehensive coverage)
63
+
64
+ ### Step 3: Validate Official Publication
65
+ A fetched result is considered an official publication if ALL criteria met:
66
+ - ✓ Has DOI (and DOI is NOT arxiv.org)
67
+ - ✓ Venue is NOT "arXiv" or "preprint"
68
+ - ✓ Venue matches known conferences/journals (NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, ICCV, ECCV, ACM, IEEE, etc.)
69
+ - ✓ Title similarity > 95%
70
+ - ✓ Author overlap > 70%
71
+ - ✓ Year is same or +1 year from arXiv (accounts for publication delay)
72
+
73
+ ### Step 4: Extract Official URL
74
+ Priority order:
75
+ 1. **DOI link** - `https://doi.org/{doi}` (most reliable)
76
+ 2. **Venue-specific URL** - ACL Anthology, CVF, ACM Digital Library, IEEE Xplore
77
+ 3. **Paper URL from database** - Fallback from Semantic Scholar/OpenAlex
78
+
79
+ ---
80
+
81
+ ## 4. Data Model Changes
82
+
83
+ ### ComparisonResult Enhancements
84
+ Add new fields to `ComparisonResult` dataclass:
85
+
86
+ ```python
87
+ @dataclass
88
+ class ComparisonResult:
89
+ # ... existing fields ...
90
+
91
+ # New fields for arXiv conversion
92
+ is_arxiv_preprint: bool = False
93
+ has_official_version: bool = False
94
+ official_venue: Optional[str] = None # e.g., "ACL 2025", "NeurIPS 2024"
95
+ arxiv_url: Optional[str] = None
96
+ official_url: Optional[str] = None
97
+ ```
98
+
99
+ ### New Methods
100
+ ```python
101
+ class MetadataComparator:
102
+ def detect_arxiv_entry(self, bib_entry: BibEntry) -> bool:
103
+ """Check if entry is an arXiv preprint."""
104
+
105
+ def is_official_publication(self, fetched_metadata, bib_entry: BibEntry) -> bool:
106
+ """Validate if fetched metadata represents official publication."""
107
+
108
+ def extract_venue_name(self, fetched_metadata) -> str:
109
+ """Extract clean venue name (e.g., 'ACL 2025')."""
110
+ ```
111
+
112
+ ---
113
+
114
+ ## 5. User Interface Design
115
+
116
+ ### Current UI (Verified Entry)
117
+ ```
118
+ ✓ [Entry Key] [Open paper]
119
+ Tags: ✓ Verified | Source: arxiv
120
+ ```
121
+
122
+ ### New UI (arXiv with Official Version Found)
123
+ ```
124
+ ✓ [Entry Key] [arXiv] [Official: ACL 2025]
125
+ Tags: ✓ Verified | ⬆️ Official Version Available | Source: semantic_scholar
126
+
127
+ Reference (from semantic_scholar):
128
+ Title: BERT: Pre-training of Deep Bidirectional Transformers
129
+ Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
130
+ Year: 2019
131
+ Venue: NAACL 2019
132
+ DOI: 10.18653/v1/N19-1423
133
+ ```
134
+
135
+ ### UI Elements
136
+
137
+ **1. Dual Button Layout**
138
+ - `[arXiv]` button - Gray background, links to arxiv.org
139
+ - `[Official: {venue}]` button - Blue background (prominent), links to official venue
140
+ - Both buttons side-by-side in header
141
+
142
+ **2. New Tag**
143
+ - `⬆️ Official Version Available` - Orange/blue color
144
+ - Indicates user should consider using official citation
145
+
146
+ **3. Reference Section**
147
+ - Shows official metadata (venue, year, DOI)
148
+ - This is what user will find when clicking official link
149
+
150
+ ### User Workflow
151
+ 1. User clicks `[Official: ACL 2025]` button
152
+ 2. Browser opens official venue page (e.g., https://aclanthology.org/2025.acl-long.272/)
153
+ 3. User clicks "Cite" button on venue website
154
+ 4. Popup/modal appears with BibTeX
155
+ 5. User copies official BibTeX
156
+ 6. User replaces arXiv entry in their .bib file
157
+
158
+ **Rationale:** Many academic websites (ACL Anthology, ACM, IEEE, CVF) use JavaScript modals for citations. The URL doesn't change, making automated fetching difficult without browser automation (which would break $0 cost goal). Manual copy is pragmatic and reliable.
159
+
160
+ ---
161
+
162
+ ## 6. Edge Cases & Handling
163
+
164
+ ### Case 1: arXiv Paper Not Yet Published
165
+ - **Scenario:** Paper still under review or recently posted
166
+ - **Behavior:** Show only `[arXiv]` button (current behavior)
167
+ - **Tags:** `✓ Verified | Source: arxiv` (no official version tag)
168
+
169
+ ### Case 2: Multiple Official Versions
170
+ - **Scenario:** Conference version + journal extension
171
+ - **Behavior:** Choose highest confidence match
172
+ - **Priority:** Journal > Conference > Workshop
173
+ - **Implementation:** Sort by venue prestige, pick first match
174
+
175
+ ### Case 3: Publication Delay in Databases
176
+ - **Scenario:** Paper published but databases haven't indexed yet
177
+ - **Behavior:** Shows as arXiv only (no official version found)
178
+ - **Acceptable:** Databases update with 1-4 week delay, user can re-check later
179
+
180
+ ### Case 4: False Positive Match
181
+ - **Scenario:** Different paper with very similar title
182
+ - **Mitigation:** High thresholds (95% title, 70% author overlap)
183
+ - **User Verification:** User clicks both links to verify it's the same paper
184
+ - **Risk:** Low due to strict matching criteria
185
+
186
+ ### Case 5: Mixed Input (arXiv + Official)
187
+ - **Scenario:** User pastes BibTeX with both arXiv preprints and official publications
188
+ - **Behavior:**
189
+ - Official publications → Normal verification (no conversion logic)
190
+ - arXiv preprints → Trigger conversion detection
191
+ - **Detection Rule:** Check venue field first
192
+ - If venue = "arXiv" or "preprint" → arXiv preprint
193
+ - If venue = real conference/journal → Official publication
194
+
195
+ ### Case 6: Official Entry Mentioning arXiv
196
+ - **Scenario:** BibTeX has official venue but `note = {arXiv:2304.12345}`
197
+ - **Behavior:** Treat as official publication (no conversion needed)
198
+ - **Detection:** Venue field takes precedence over notes/comments
199
+
200
+ ---
201
+
202
+ ## 7. Implementation Breakdown
203
+
204
+ ### Phase 1: Data Model & Detection Logic
205
+ **Files:** `src/analyzers/metadata_comparator.py`
206
+
207
+ 1. Add new fields to `ComparisonResult` dataclass
208
+ 2. Implement `detect_arxiv_entry()` method
209
+ 3. Implement `is_official_publication()` method
210
+ 4. Implement `extract_venue_name()` method
211
+ 5. Update all `compare_with_*()` methods to populate new fields
212
+
213
+ ### Phase 2: Workflow Integration
214
+ **Files:** `app.py`
215
+
216
+ 1. Modify `process_single_entry()`:
217
+ - After finding best match, check if original entry is arXiv
218
+ - If arXiv, validate if match is official publication
219
+ - Store both arXiv URL and official URL
220
+ 2. Update comparison result creation logic
221
+
222
+ ### Phase 3: UI Rendering
223
+ **Files:** `app.py`
224
+
225
+ 1. Modify `format_entry_card()`:
226
+ - Check `has_official_version` flag
227
+ - Render dual buttons if true
228
+ - Add "Official Version Available" tag
229
+ - Update reference section to show official venue
230
+ 2. Update button generation logic in header
231
+
232
+ ### Phase 4: CSS Styling
233
+ **Files:** `app.py` - `REPORT_CSS`
234
+
235
+ 1. Add styles for dual-button layout
236
+ 2. Style official button prominently (blue background)
237
+ 3. Style arXiv button subtly (gray background)
238
+ 4. Add styling for new tag type (orange/blue)
239
+ 5. Ensure responsive layout for mobile
240
+
241
+ ### Phase 5: Testing
242
+ 1. Test with pure arXiv entries
243
+ 2. Test with pure official entries
244
+ 3. Test with mixed entries
245
+ 4. Test edge cases (not published, false positives, etc.)
246
+ 5. Test UI on different screen sizes
247
+
248
+ ---
249
+
250
+ ## 8. Known Venue Patterns
251
+
252
+ ### Top-tier AI/ML/CS Conferences
253
+ - NeurIPS, ICML, ICLR, AAAI, IJCAI
254
+ - ACL, EMNLP, NAACL, EACL, COLING
255
+ - CVPR, ICCV, ECCV, SIGGRAPH
256
+ - SIGIR, KDD, WWW, WSDM
257
+ - ICSE, FSE, ASE, ISSTA
258
+
259
+ ### Top-tier Journals
260
+ - JMLR, PAMI, IJCV, TACL
261
+ - CACM, TOCS, TODS, TKDE
262
+
263
+ ### Venue Name Variations
264
+ - Handle abbreviations: "ICLR" vs "International Conference on Learning Representations"
265
+ - Handle year formats: "ACL 2025" vs "ACL'25" vs "Proceedings of ACL 2025"
266
+ - Normalize for comparison
267
+
268
+ ---
269
+
270
+ ## 9. Success Metrics
271
+
272
+ ### Functional Metrics
273
+ - **Detection Accuracy:** >95% of arXiv entries correctly identified
274
+ - **Match Accuracy:** >90% of official versions correctly matched
275
+ - **False Positive Rate:** <5% incorrect matches
276
+
277
+ ### User Experience Metrics
278
+ - **Conversion Rate:** % of users who click official links
279
+ - **Time Saved:** Estimated 2-5 minutes per converted citation
280
+ - **User Satisfaction:** Qualitative feedback on feature usefulness
281
+
282
+ ### Technical Metrics
283
+ - **API Cost:** $0 (using existing free APIs)
284
+ - **Performance:** <500ms additional processing per entry
285
+ - **Reliability:** No new external dependencies
286
+
287
+ ---
288
+
289
+ ## 10. Future Enhancements (Out of Scope)
290
+
291
+ ### Potential Extensions
292
+ 1. **One-click BibTeX replacement** - Fetch official BibTeX directly (requires browser automation or venue-specific parsers)
293
+ 2. **Batch export** - Export all official versions as a new .bib file
294
+ 3. **bioRxiv support** - Extend to biology/medical preprints
295
+ 4. **Preprint quality scoring** - Flag papers that should be updated to official versions
296
+ 5. **Citation style conversion** - Convert between BibTeX, BibLaTeX, EndNote, etc.
297
+
298
+ ### Why Not Now
299
+ - Maintain focus on core arXiv→Official conversion
300
+ - Keep $0 maintenance cost constraint
301
+ - Avoid scope creep during initial implementation
302
+
303
+ ---
304
+
305
+ ## 11. Risks & Mitigations
306
+
307
+ ### Risk 1: API Rate Limits
308
+ - **Impact:** Semantic Scholar, CrossRef have rate limits
309
+ - **Mitigation:** Already handled in existing fetchers with delays/retries
310
+ - **Status:** Low risk (existing system already manages this)
311
+
312
+ ### Risk 2: Database Coverage Gaps
313
+ - **Impact:** Some papers not indexed in any database
314
+ - **Mitigation:** Use multiple databases (Semantic Scholar, CrossRef, DBLP, OpenAlex)
315
+ - **Status:** Acceptable (user can manually verify)
316
+
317
+ ### Risk 3: Venue Website Changes
318
+ - **Impact:** Official venue websites change their "Cite" button UI
319
+ - **Mitigation:** We only provide links, not scraping - user handles the rest
320
+ - **Status:** Low risk (our system is decoupled from venue UI)
321
+
322
+ ### Risk 4: False Matches
323
+ - **Impact:** Suggest wrong official version
324
+ - **Mitigation:** High similarity thresholds + user verification via dual links
325
+ - **Status:** Low risk (strict matching criteria)
326
+
327
+ ---
328
+
329
+ ## 12. Summary
330
+
331
+ ### What Changes
332
+ - **From:** BibTeX verification tool
333
+ - **To:** Smart bibliography management system with arXiv→Official conversion
334
+
335
+ ### Key Innovation
336
+ - Automatic detection of arXiv preprints with official publications
337
+ - Dual-link UI for easy access to both versions
338
+ - Zero maintenance cost using existing free APIs
339
+
340
+ ### User Value
341
+ - Save 2-5 minutes per citation conversion
342
+ - Reduce errors in bibliography management
343
+ - Improve citation quality for paper submissions
344
+
345
+ ### Technical Approach
346
+ - Extend existing verification workflow
347
+ - Add detection logic for arXiv entries
348
+ - Enhance UI with dual buttons and tags
349
+ - Maintain $0 cost constraint
350
+
351
+ ---
352
+
353
+ **Next Steps:**
354
+ 1. Review and approve this design
355
+ 2. Set up git worktree for isolated development
356
+ 3. Create detailed implementation plan
357
+ 4. Begin Phase 1 implementation
src/analyzers/metadata_comparator.py CHANGED
@@ -56,7 +56,7 @@ class MetadataComparator:
56
 
57
  # Thresholds for matching
58
  TITLE_THRESHOLD = 0.99
59
- AUTHOR_THRESHOLD = 0.65
60
 
61
  def __init__(self):
62
  self.normalizer = TextNormalizer
 
56
 
57
  # Thresholds for matching
58
  TITLE_THRESHOLD = 0.99
59
+ AUTHOR_THRESHOLD = 0.5
60
 
61
  def __init__(self):
62
  self.normalizer = TextNormalizer