Merge branch 'main' of https://github.com/aivolcano/CiteScan
Browse files- .gitignore +5 -1
- app.py +4 -4
- docs/plans/2026-01-30-arxiv-to-official-conversion-design.md +357 -0
- src/analyzers/metadata_comparator.py +1 -1
.gitignore
CHANGED
|
@@ -41,6 +41,7 @@ env/
|
|
| 41 |
# Project Specific Outputs
|
| 42 |
*.md
|
| 43 |
!README.md
|
|
|
|
| 44 |
*_only_used_entry.bib
|
| 45 |
|
| 46 |
# LaTeX and Bibliography (User Data)
|
|
@@ -58,4 +59,7 @@ env/
|
|
| 58 |
*.fdb_latexmk
|
| 59 |
|
| 60 |
# cache
|
| 61 |
-
.cache
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
# Project Specific Outputs
|
| 42 |
*.md
|
| 43 |
!README.md
|
| 44 |
+
!docs/**/*.md
|
| 45 |
*_only_used_entry.bib
|
| 46 |
|
| 47 |
# LaTeX and Bibliography (User Data)
|
|
|
|
| 59 |
*.fdb_latexmk
|
| 60 |
|
| 61 |
# cache
|
| 62 |
+
.cache
|
| 63 |
+
|
| 64 |
+
# Git worktrees
|
| 65 |
+
.worktrees/
|
app.py
CHANGED
|
@@ -579,19 +579,19 @@ with gr.Blocks(title="CiteScan - Check References, Confirm Truth.", theme=gr.the
|
|
| 579 |
*Case Study for False positive* in CiteScan:
|
| 580 |
|
| 581 |
1. **Authors Mismatch**:
|
| 582 |
-
- *
|
| 583 |
- *Action*: Verify if main authors match
|
| 584 |
|
| 585 |
2. **Venues Mismatch**:
|
| 586 |
-
- *
|
| 587 |
- *Action*: Both are correct.
|
| 588 |
|
| 589 |
3. **Year GAP (±1 Year)**:
|
| 590 |
-
- *
|
| 591 |
- *Action*: Verify which version you intend to cite, We recommend you to cite the version from the official press website. Less number of pre-print version bibs will make your submission more convincing.
|
| 592 |
|
| 593 |
4. **Non-academic Sources**:
|
| 594 |
-
- *
|
| 595 |
- *Action*: Verify URL, year, and title manually.
|
| 596 |
---
|
| 597 |
**Supported Data Sources:** arXiv, CrossRef, DBLP, Semantic Scholar, ACL Anthology, ACM, theCVF,
|
|
|
|
| 579 |
*Case Study for False positive* in CiteScan:
|
| 580 |
|
| 581 |
1. **Authors Mismatch**:
|
| 582 |
+
- *Observation*: Different databases deal with a longer list of authors with different strategies, like truncation.
|
| 583 |
- *Action*: Verify if main authors match
|
| 584 |
|
| 585 |
2. **Venues Mismatch**:
|
| 586 |
+
- *Observation*: Abbreviations vs. full names, such as "ICLR" v.s. "International Conference on Learning Representations"
|
| 587 |
- *Action*: Both are correct.
|
| 588 |
|
| 589 |
3. **Year GAP (±1 Year)**:
|
| 590 |
+
- *Observation*: Delay between preprint (arXiv) and final version publication
|
| 591 |
- *Action*: Verify which version you intend to cite, We recommend you to cite the version from the official press website. Less number of pre-print version bibs will make your submission more convincing.
|
| 592 |
|
| 593 |
4. **Non-academic Sources**:
|
| 594 |
+
- *Observation*: Blogs, and APIs are not indexed in academic databases.
|
| 595 |
- *Action*: Verify URL, year, and title manually.
|
| 596 |
---
|
| 597 |
**Supported Data Sources:** arXiv, CrossRef, DBLP, Semantic Scholar, ACL Anthology, ACM, theCVF,
|
docs/plans/2026-01-30-arxiv-to-official-conversion-design.md
ADDED
|
@@ -0,0 +1,357 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# arXiv to Official Publication Conversion - Design Document
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-01-30
|
| 4 |
+
**Feature:** Smart BibTeX Cleanup - arXiv → Official Publication Conversion
|
| 5 |
+
**Goal:** Transform CiteScan from a verification tool into an original smart bibliography management system
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Overview
|
| 10 |
+
|
| 11 |
+
### Problem Statement
|
| 12 |
+
Researchers often cite arXiv preprints in their bibliographies, but academic submissions prefer official publication citations (conference/journal versions). Manually finding and updating these citations is tedious and error-prone.
|
| 13 |
+
|
| 14 |
+
### Solution
|
| 15 |
+
Extend CiteScan to automatically detect when an arXiv preprint has been officially published, and provide users with easy access to both versions so they can copy the official BibTeX from the venue website.
|
| 16 |
+
|
| 17 |
+
### Key Constraint
|
| 18 |
+
**$0 maintenance cost** - No ongoing API costs, no server infrastructure. Pure algorithmic processing using existing free academic APIs.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 2. Core Architecture
|
| 23 |
+
|
| 24 |
+
### Current Flow
|
| 25 |
+
```
|
| 26 |
+
Parse BibTeX → Fetch metadata → Compare & verify → Display results
|
| 27 |
+
```
|
| 28 |
+
|
| 29 |
+
### Enhanced Flow
|
| 30 |
+
```
|
| 31 |
+
Parse BibTeX → Fetch metadata → Compare & verify → Detect arXiv preprints
|
| 32 |
+
↓
|
| 33 |
+
If arXiv detected → Search for official version → Match validation
|
| 34 |
+
↓
|
| 35 |
+
Display results with dual links (arXiv + Official)
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
### Components Modified
|
| 39 |
+
1. **Data Model** (`src/analyzers/metadata_comparator.py`)
|
| 40 |
+
2. **Detection Logic** (`src/analyzers/metadata_comparator.py`)
|
| 41 |
+
3. **Workflow** (`app.py` - `process_single_entry()`)
|
| 42 |
+
4. **UI Rendering** (`app.py` - `format_entry_card()`)
|
| 43 |
+
5. **CSS Styling** (`app.py` - `REPORT_CSS`)
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## 3. Detection Logic
|
| 48 |
+
|
| 49 |
+
### Step 1: Identify arXiv Entries
|
| 50 |
+
An entry is considered an arXiv preprint if:
|
| 51 |
+
- Has `eprint` field with arXiv ID pattern (e.g., "2304.12345")
|
| 52 |
+
- `journal` or `booktitle` field contains "arXiv" or "preprint"
|
| 53 |
+
- URL contains "arxiv.org"
|
| 54 |
+
|
| 55 |
+
**Important:** If venue field contains known conference/journal names (ACL, NeurIPS, ICML, etc.), treat as official publication even if arXiv is mentioned in notes.
|
| 56 |
+
|
| 57 |
+
### Step 2: Search for Official Version
|
| 58 |
+
Use existing fetchers in priority order:
|
| 59 |
+
1. **Semantic Scholar** - Query by arXiv ID (best for finding publication venue)
|
| 60 |
+
2. **CrossRef** - Query by title (best for DOI)
|
| 61 |
+
3. **DBLP** - Query by title (best for CS conferences)
|
| 62 |
+
4. **OpenAlex** - Query by DOI or title (comprehensive coverage)
|
| 63 |
+
|
| 64 |
+
### Step 3: Validate Official Publication
|
| 65 |
+
A fetched result is considered an official publication if ALL criteria met:
|
| 66 |
+
- ✓ Has DOI (and DOI is NOT arxiv.org)
|
| 67 |
+
- ✓ Venue is NOT "arXiv" or "preprint"
|
| 68 |
+
- ✓ Venue matches known conferences/journals (NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, ICCV, ECCV, ACM, IEEE, etc.)
|
| 69 |
+
- ✓ Title similarity > 95%
|
| 70 |
+
- ✓ Author overlap > 70%
|
| 71 |
+
- ✓ Year is same or +1 year from arXiv (accounts for publication delay)
|
| 72 |
+
|
| 73 |
+
### Step 4: Extract Official URL
|
| 74 |
+
Priority order:
|
| 75 |
+
1. **DOI link** - `https://doi.org/{doi}` (most reliable)
|
| 76 |
+
2. **Venue-specific URL** - ACL Anthology, CVF, ACM Digital Library, IEEE Xplore
|
| 77 |
+
3. **Paper URL from database** - Fallback from Semantic Scholar/OpenAlex
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## 4. Data Model Changes
|
| 82 |
+
|
| 83 |
+
### ComparisonResult Enhancements
|
| 84 |
+
Add new fields to `ComparisonResult` dataclass:
|
| 85 |
+
|
| 86 |
+
```python
|
| 87 |
+
@dataclass
|
| 88 |
+
class ComparisonResult:
|
| 89 |
+
# ... existing fields ...
|
| 90 |
+
|
| 91 |
+
# New fields for arXiv conversion
|
| 92 |
+
is_arxiv_preprint: bool = False
|
| 93 |
+
has_official_version: bool = False
|
| 94 |
+
official_venue: Optional[str] = None # e.g., "ACL 2025", "NeurIPS 2024"
|
| 95 |
+
arxiv_url: Optional[str] = None
|
| 96 |
+
official_url: Optional[str] = None
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
### New Methods
|
| 100 |
+
```python
|
| 101 |
+
class MetadataComparator:
|
| 102 |
+
def detect_arxiv_entry(self, bib_entry: BibEntry) -> bool:
|
| 103 |
+
"""Check if entry is an arXiv preprint."""
|
| 104 |
+
|
| 105 |
+
def is_official_publication(self, fetched_metadata, bib_entry: BibEntry) -> bool:
|
| 106 |
+
"""Validate if fetched metadata represents official publication."""
|
| 107 |
+
|
| 108 |
+
def extract_venue_name(self, fetched_metadata) -> str:
|
| 109 |
+
"""Extract clean venue name (e.g., 'ACL 2025')."""
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
---
|
| 113 |
+
|
| 114 |
+
## 5. User Interface Design
|
| 115 |
+
|
| 116 |
+
### Current UI (Verified Entry)
|
| 117 |
+
```
|
| 118 |
+
✓ [Entry Key] [Open paper]
|
| 119 |
+
Tags: ✓ Verified | Source: arxiv
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
### New UI (arXiv with Official Version Found)
|
| 123 |
+
```
|
| 124 |
+
✓ [Entry Key] [arXiv] [Official: ACL 2025]
|
| 125 |
+
Tags: ✓ Verified | ⬆️ Official Version Available | Source: semantic_scholar
|
| 126 |
+
|
| 127 |
+
Reference (from semantic_scholar):
|
| 128 |
+
Title: BERT: Pre-training of Deep Bidirectional Transformers
|
| 129 |
+
Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
|
| 130 |
+
Year: 2019
|
| 131 |
+
Venue: NAACL 2019
|
| 132 |
+
DOI: 10.18653/v1/N19-1423
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
### UI Elements
|
| 136 |
+
|
| 137 |
+
**1. Dual Button Layout**
|
| 138 |
+
- `[arXiv]` button - Gray background, links to arxiv.org
|
| 139 |
+
- `[Official: {venue}]` button - Blue background (prominent), links to official venue
|
| 140 |
+
- Both buttons side-by-side in header
|
| 141 |
+
|
| 142 |
+
**2. New Tag**
|
| 143 |
+
- `⬆️ Official Version Available` - Orange/blue color
|
| 144 |
+
- Indicates user should consider using official citation
|
| 145 |
+
|
| 146 |
+
**3. Reference Section**
|
| 147 |
+
- Shows official metadata (venue, year, DOI)
|
| 148 |
+
- This is what user will find when clicking official link
|
| 149 |
+
|
| 150 |
+
### User Workflow
|
| 151 |
+
1. User clicks `[Official: ACL 2025]` button
|
| 152 |
+
2. Browser opens official venue page (e.g., https://aclanthology.org/2025.acl-long.272/)
|
| 153 |
+
3. User clicks "Cite" button on venue website
|
| 154 |
+
4. Popup/modal appears with BibTeX
|
| 155 |
+
5. User copies official BibTeX
|
| 156 |
+
6. User replaces arXiv entry in their .bib file
|
| 157 |
+
|
| 158 |
+
**Rationale:** Many academic websites (ACL Anthology, ACM, IEEE, CVF) use JavaScript modals for citations. The URL doesn't change, making automated fetching difficult without browser automation (which would break $0 cost goal). Manual copy is pragmatic and reliable.
|
| 159 |
+
|
| 160 |
+
---
|
| 161 |
+
|
| 162 |
+
## 6. Edge Cases & Handling
|
| 163 |
+
|
| 164 |
+
### Case 1: arXiv Paper Not Yet Published
|
| 165 |
+
- **Scenario:** Paper still under review or recently posted
|
| 166 |
+
- **Behavior:** Show only `[arXiv]` button (current behavior)
|
| 167 |
+
- **Tags:** `✓ Verified | Source: arxiv` (no official version tag)
|
| 168 |
+
|
| 169 |
+
### Case 2: Multiple Official Versions
|
| 170 |
+
- **Scenario:** Conference version + journal extension
|
| 171 |
+
- **Behavior:** Choose highest confidence match
|
| 172 |
+
- **Priority:** Journal > Conference > Workshop
|
| 173 |
+
- **Implementation:** Sort by venue prestige, pick first match
|
| 174 |
+
|
| 175 |
+
### Case 3: Publication Delay in Databases
|
| 176 |
+
- **Scenario:** Paper published but databases haven't indexed yet
|
| 177 |
+
- **Behavior:** Shows as arXiv only (no official version found)
|
| 178 |
+
- **Acceptable:** Databases update with 1-4 week delay, user can re-check later
|
| 179 |
+
|
| 180 |
+
### Case 4: False Positive Match
|
| 181 |
+
- **Scenario:** Different paper with very similar title
|
| 182 |
+
- **Mitigation:** High thresholds (95% title, 70% author overlap)
|
| 183 |
+
- **User Verification:** User clicks both links to verify it's the same paper
|
| 184 |
+
- **Risk:** Low due to strict matching criteria
|
| 185 |
+
|
| 186 |
+
### Case 5: Mixed Input (arXiv + Official)
|
| 187 |
+
- **Scenario:** User pastes BibTeX with both arXiv preprints and official publications
|
| 188 |
+
- **Behavior:**
|
| 189 |
+
- Official publications → Normal verification (no conversion logic)
|
| 190 |
+
- arXiv preprints → Trigger conversion detection
|
| 191 |
+
- **Detection Rule:** Check venue field first
|
| 192 |
+
- If venue = "arXiv" or "preprint" → arXiv preprint
|
| 193 |
+
- If venue = real conference/journal → Official publication
|
| 194 |
+
|
| 195 |
+
### Case 6: Official Entry Mentioning arXiv
|
| 196 |
+
- **Scenario:** BibTeX has official venue but `note = {arXiv:2304.12345}`
|
| 197 |
+
- **Behavior:** Treat as official publication (no conversion needed)
|
| 198 |
+
- **Detection:** Venue field takes precedence over notes/comments
|
| 199 |
+
|
| 200 |
+
---
|
| 201 |
+
|
| 202 |
+
## 7. Implementation Breakdown
|
| 203 |
+
|
| 204 |
+
### Phase 1: Data Model & Detection Logic
|
| 205 |
+
**Files:** `src/analyzers/metadata_comparator.py`
|
| 206 |
+
|
| 207 |
+
1. Add new fields to `ComparisonResult` dataclass
|
| 208 |
+
2. Implement `detect_arxiv_entry()` method
|
| 209 |
+
3. Implement `is_official_publication()` method
|
| 210 |
+
4. Implement `extract_venue_name()` method
|
| 211 |
+
5. Update all `compare_with_*()` methods to populate new fields
|
| 212 |
+
|
| 213 |
+
### Phase 2: Workflow Integration
|
| 214 |
+
**Files:** `app.py`
|
| 215 |
+
|
| 216 |
+
1. Modify `process_single_entry()`:
|
| 217 |
+
- After finding best match, check if original entry is arXiv
|
| 218 |
+
- If arXiv, validate if match is official publication
|
| 219 |
+
- Store both arXiv URL and official URL
|
| 220 |
+
2. Update comparison result creation logic
|
| 221 |
+
|
| 222 |
+
### Phase 3: UI Rendering
|
| 223 |
+
**Files:** `app.py`
|
| 224 |
+
|
| 225 |
+
1. Modify `format_entry_card()`:
|
| 226 |
+
- Check `has_official_version` flag
|
| 227 |
+
- Render dual buttons if true
|
| 228 |
+
- Add "Official Version Available" tag
|
| 229 |
+
- Update reference section to show official venue
|
| 230 |
+
2. Update button generation logic in header
|
| 231 |
+
|
| 232 |
+
### Phase 4: CSS Styling
|
| 233 |
+
**Files:** `app.py` - `REPORT_CSS`
|
| 234 |
+
|
| 235 |
+
1. Add styles for dual-button layout
|
| 236 |
+
2. Style official button prominently (blue background)
|
| 237 |
+
3. Style arXiv button subtly (gray background)
|
| 238 |
+
4. Add styling for new tag type (orange/blue)
|
| 239 |
+
5. Ensure responsive layout for mobile
|
| 240 |
+
|
| 241 |
+
### Phase 5: Testing
|
| 242 |
+
1. Test with pure arXiv entries
|
| 243 |
+
2. Test with pure official entries
|
| 244 |
+
3. Test with mixed entries
|
| 245 |
+
4. Test edge cases (not published, false positives, etc.)
|
| 246 |
+
5. Test UI on different screen sizes
|
| 247 |
+
|
| 248 |
+
---
|
| 249 |
+
|
| 250 |
+
## 8. Known Venue Patterns
|
| 251 |
+
|
| 252 |
+
### Top-tier AI/ML/CS Conferences
|
| 253 |
+
- NeurIPS, ICML, ICLR, AAAI, IJCAI
|
| 254 |
+
- ACL, EMNLP, NAACL, EACL, COLING
|
| 255 |
+
- CVPR, ICCV, ECCV, SIGGRAPH
|
| 256 |
+
- SIGIR, KDD, WWW, WSDM
|
| 257 |
+
- ICSE, FSE, ASE, ISSTA
|
| 258 |
+
|
| 259 |
+
### Top-tier Journals
|
| 260 |
+
- JMLR, PAMI, IJCV, TACL
|
| 261 |
+
- CACM, TOCS, TODS, TKDE
|
| 262 |
+
|
| 263 |
+
### Venue Name Variations
|
| 264 |
+
- Handle abbreviations: "ICLR" vs "International Conference on Learning Representations"
|
| 265 |
+
- Handle year formats: "ACL 2025" vs "ACL'25" vs "Proceedings of ACL 2025"
|
| 266 |
+
- Normalize for comparison
|
| 267 |
+
|
| 268 |
+
---
|
| 269 |
+
|
| 270 |
+
## 9. Success Metrics
|
| 271 |
+
|
| 272 |
+
### Functional Metrics
|
| 273 |
+
- **Detection Accuracy:** >95% of arXiv entries correctly identified
|
| 274 |
+
- **Match Accuracy:** >90% of official versions correctly matched
|
| 275 |
+
- **False Positive Rate:** <5% incorrect matches
|
| 276 |
+
|
| 277 |
+
### User Experience Metrics
|
| 278 |
+
- **Conversion Rate:** % of users who click official links
|
| 279 |
+
- **Time Saved:** Estimated 2-5 minutes per converted citation
|
| 280 |
+
- **User Satisfaction:** Qualitative feedback on feature usefulness
|
| 281 |
+
|
| 282 |
+
### Technical Metrics
|
| 283 |
+
- **API Cost:** $0 (using existing free APIs)
|
| 284 |
+
- **Performance:** <500ms additional processing per entry
|
| 285 |
+
- **Reliability:** No new external dependencies
|
| 286 |
+
|
| 287 |
+
---
|
| 288 |
+
|
| 289 |
+
## 10. Future Enhancements (Out of Scope)
|
| 290 |
+
|
| 291 |
+
### Potential Extensions
|
| 292 |
+
1. **One-click BibTeX replacement** - Fetch official BibTeX directly (requires browser automation or venue-specific parsers)
|
| 293 |
+
2. **Batch export** - Export all official versions as a new .bib file
|
| 294 |
+
3. **bioRxiv support** - Extend to biology/medical preprints
|
| 295 |
+
4. **Preprint quality scoring** - Flag papers that should be updated to official versions
|
| 296 |
+
5. **Citation style conversion** - Convert between BibTeX, BibLaTeX, EndNote, etc.
|
| 297 |
+
|
| 298 |
+
### Why Not Now
|
| 299 |
+
- Maintain focus on core arXiv→Official conversion
|
| 300 |
+
- Keep $0 maintenance cost constraint
|
| 301 |
+
- Avoid scope creep during initial implementation
|
| 302 |
+
|
| 303 |
+
---
|
| 304 |
+
|
| 305 |
+
## 11. Risks & Mitigations
|
| 306 |
+
|
| 307 |
+
### Risk 1: API Rate Limits
|
| 308 |
+
- **Impact:** Semantic Scholar, CrossRef have rate limits
|
| 309 |
+
- **Mitigation:** Already handled in existing fetchers with delays/retries
|
| 310 |
+
- **Status:** Low risk (existing system already manages this)
|
| 311 |
+
|
| 312 |
+
### Risk 2: Database Coverage Gaps
|
| 313 |
+
- **Impact:** Some papers not indexed in any database
|
| 314 |
+
- **Mitigation:** Use multiple databases (Semantic Scholar, CrossRef, DBLP, OpenAlex)
|
| 315 |
+
- **Status:** Acceptable (user can manually verify)
|
| 316 |
+
|
| 317 |
+
### Risk 3: Venue Website Changes
|
| 318 |
+
- **Impact:** Official venue websites change their "Cite" button UI
|
| 319 |
+
- **Mitigation:** We only provide links, not scraping - user handles the rest
|
| 320 |
+
- **Status:** Low risk (our system is decoupled from venue UI)
|
| 321 |
+
|
| 322 |
+
### Risk 4: False Matches
|
| 323 |
+
- **Impact:** Suggest wrong official version
|
| 324 |
+
- **Mitigation:** High similarity thresholds + user verification via dual links
|
| 325 |
+
- **Status:** Low risk (strict matching criteria)
|
| 326 |
+
|
| 327 |
+
---
|
| 328 |
+
|
| 329 |
+
## 12. Summary
|
| 330 |
+
|
| 331 |
+
### What Changes
|
| 332 |
+
- **From:** BibTeX verification tool
|
| 333 |
+
- **To:** Smart bibliography management system with arXiv→Official conversion
|
| 334 |
+
|
| 335 |
+
### Key Innovation
|
| 336 |
+
- Automatic detection of arXiv preprints with official publications
|
| 337 |
+
- Dual-link UI for easy access to both versions
|
| 338 |
+
- Zero maintenance cost using existing free APIs
|
| 339 |
+
|
| 340 |
+
### User Value
|
| 341 |
+
- Save 2-5 minutes per citation conversion
|
| 342 |
+
- Reduce errors in bibliography management
|
| 343 |
+
- Improve citation quality for paper submissions
|
| 344 |
+
|
| 345 |
+
### Technical Approach
|
| 346 |
+
- Extend existing verification workflow
|
| 347 |
+
- Add detection logic for arXiv entries
|
| 348 |
+
- Enhance UI with dual buttons and tags
|
| 349 |
+
- Maintain $0 cost constraint
|
| 350 |
+
|
| 351 |
+
---
|
| 352 |
+
|
| 353 |
+
**Next Steps:**
|
| 354 |
+
1. Review and approve this design
|
| 355 |
+
2. Set up git worktree for isolated development
|
| 356 |
+
3. Create detailed implementation plan
|
| 357 |
+
4. Begin Phase 1 implementation
|
src/analyzers/metadata_comparator.py
CHANGED
|
@@ -56,7 +56,7 @@ class MetadataComparator:
|
|
| 56 |
|
| 57 |
# Thresholds for matching
|
| 58 |
TITLE_THRESHOLD = 0.99
|
| 59 |
-
AUTHOR_THRESHOLD = 0.
|
| 60 |
|
| 61 |
def __init__(self):
|
| 62 |
self.normalizer = TextNormalizer
|
|
|
|
| 56 |
|
| 57 |
# Thresholds for matching
|
| 58 |
TITLE_THRESHOLD = 0.99
|
| 59 |
+
AUTHOR_THRESHOLD = 0.5
|
| 60 |
|
| 61 |
def __init__(self):
|
| 62 |
self.normalizer = TextNormalizer
|