Gurusha commited on
Commit
94eec12
·
1 Parent(s): 8722515

Initial deploy - HaRC reference checker

Browse files
Files changed (3) hide show
  1. README.md +29 -5
  2. app.py +378 -0
  3. requirements.txt +3 -0
README.md CHANGED
@@ -1,12 +1,36 @@
1
  ---
2
- title: Harc
3
  emoji: 📚
4
- colorFrom: gray
5
- colorTo: green
6
  sdk: gradio
7
- sdk_version: 6.4.0
8
  app_file: app.py
9
  pinned: false
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: HaRC - Hallucinated Reference Checker
3
  emoji: 📚
4
+ colorFrom: purple
5
+ colorTo: indigo
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
11
+ short_description: Verify academic references against databases
12
  ---
13
 
14
+ # HaRC - Hallucinated Reference Checker
15
+
16
+ Verify your paper's references against academic databases before submission.
17
+
18
+ ## Features
19
+
20
+ - **Upload PDF** - Extract and check references automatically
21
+ - **Paste BibTeX** - Check your `.bib` file directly
22
+ - **Multiple databases** - Semantic Scholar, DBLP, Google Scholar, Open Library
23
+ - **Fuzzy matching** - Handles author name variations and typos
24
+
25
+ ## How it works
26
+
27
+ 1. Upload your PDF or paste your BibTeX
28
+ 2. References are extracted and parsed
29
+ 3. Each reference is checked against academic databases
30
+ 4. Results show verified references and any issues found
31
+
32
+ ## Links
33
+
34
+ - [GitHub Repository](https://github.com/gurusha01/HaRC)
35
+ - [PyPI Package](https://pypi.org/project/harcx/)
36
+ - [CLI Documentation](https://github.com/gurusha01/HaRC#readme)
app.py ADDED
@@ -0,0 +1,378 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """HaRC - Hallucinated Reference Checker (Hugging Face Spaces version)."""
2
+
3
+ import re
4
+ import tempfile
5
+ from pathlib import Path
6
+
7
+ import gradio as gr
8
+ import pymupdf # PyMuPDF
9
+
10
+ from reference_checker import check_citations
11
+
12
+
13
+ def extract_references_section(text: str) -> str:
14
+ """Extract the references/bibliography section from paper text."""
15
+ # Common section headers for references
16
+ patterns = [
17
+ r'\n\s*References\s*\n',
18
+ r'\n\s*REFERENCES\s*\n',
19
+ r'\n\s*Bibliography\s*\n',
20
+ r'\n\s*BIBLIOGRAPHY\s*\n',
21
+ r'\n\s*Works Cited\s*\n',
22
+ r'\n\s*Literature Cited\s*\n',
23
+ ]
24
+
25
+ for pattern in patterns:
26
+ match = re.search(pattern, text, re.IGNORECASE)
27
+ if match:
28
+ return text[match.end():]
29
+
30
+ # If no header found, return last 30% of document (often contains refs)
31
+ return text[int(len(text) * 0.7):]
32
+
33
+
34
+ def parse_references_from_text(text: str) -> list[dict]:
35
+ """Parse individual references from extracted text.
36
+
37
+ Uses heuristics to identify reference boundaries and extract metadata.
38
+ """
39
+ references = []
40
+
41
+ # Clean up text
42
+ text = re.sub(r'\s+', ' ', text)
43
+
44
+ # Try to split by common reference patterns
45
+ # Pattern 1: [1], [2], etc.
46
+ numbered_refs = re.split(r'\[\d+\]\s*', text)
47
+ if len(numbered_refs) > 3:
48
+ refs_list = [r.strip() for r in numbered_refs if r.strip()]
49
+ else:
50
+ # Pattern 2: 1. 2. 3. etc at start of line
51
+ refs_list = re.split(r'(?:^|\n)\d+\.\s+', text)
52
+ refs_list = [r.strip() for r in refs_list if r.strip()]
53
+
54
+ if len(refs_list) < 3:
55
+ # Pattern 3: Split by author name patterns (Name, Initial.)
56
+ refs_list = re.split(r'(?<=[.?!])\s+(?=[A-Z][a-z]+,?\s+[A-Z]\.)', text)
57
+ refs_list = [r.strip() for r in refs_list if r.strip() and len(r) > 30]
58
+
59
+ for ref_text in refs_list[:100]: # Limit to 100 refs
60
+ ref = parse_single_reference(ref_text)
61
+ if ref and ref.get('title'):
62
+ references.append(ref)
63
+
64
+ return references
65
+
66
+
67
+ def parse_single_reference(text: str) -> dict | None:
68
+ """Parse a single reference string into structured data."""
69
+ if len(text) < 20:
70
+ return None
71
+
72
+ ref = {}
73
+
74
+ # Extract year (4 digits, typically 1900-2099)
75
+ year_match = re.search(r'\b(19|20)\d{2}\b', text)
76
+ if year_match:
77
+ ref['year'] = year_match.group()
78
+
79
+ # Extract DOI if present
80
+ doi_match = re.search(r'10\.\d{4,}/[^\s]+', text)
81
+ if doi_match:
82
+ ref['doi'] = doi_match.group().rstrip('.')
83
+
84
+ # Extract arXiv ID if present
85
+ arxiv_match = re.search(r'arXiv:(\d{4}\.\d{4,5})', text, re.IGNORECASE)
86
+ if arxiv_match:
87
+ ref['arxiv'] = arxiv_match.group(1)
88
+
89
+ # Try to extract title (usually in quotes or after authors, before journal)
90
+ # Pattern: Look for text in quotes
91
+ title_match = re.search(r'["\u201c]([^"\u201d]+)["\u201d]', text)
92
+ if title_match:
93
+ ref['title'] = title_match.group(1).strip()
94
+ else:
95
+ # Heuristic: title is often after year and authors, before journal/venue
96
+ # Take a reasonable chunk after the year
97
+ if year_match:
98
+ after_year = text[year_match.end():].strip()
99
+ # Remove leading punctuation
100
+ after_year = re.sub(r'^[.,)\]]\s*', '', after_year)
101
+ # Take first sentence-like chunk
102
+ title_candidate = re.split(r'[.!?]', after_year)[0].strip()
103
+ if 10 < len(title_candidate) < 200:
104
+ ref['title'] = title_candidate
105
+
106
+ # If still no title, try beginning of text (before year)
107
+ if not ref.get('title') and year_match:
108
+ before_year = text[:year_match.start()].strip()
109
+ # Look for the last comma-separated segment before year as potential title
110
+ parts = before_year.rsplit('.', 1)
111
+ if len(parts) > 1 and len(parts[-1].strip()) > 10:
112
+ ref['title'] = parts[-1].strip()
113
+
114
+ # Extract authors (usually at the beginning)
115
+ if year_match:
116
+ author_text = text[:year_match.start()].strip()
117
+ # Clean up and extract author names
118
+ author_text = re.sub(r'[,.]$', '', author_text)
119
+ if author_text and len(author_text) < 500:
120
+ # Split by 'and' or comma
121
+ author_parts = re.split(r'\s+and\s+|,\s*', author_text)
122
+ authors = []
123
+ for part in author_parts:
124
+ part = part.strip()
125
+ # Filter out non-name parts
126
+ if part and len(part) > 2 and not part.isdigit():
127
+ # Check if it looks like a name (has capital letter)
128
+ if re.search(r'[A-Z]', part):
129
+ authors.append(part)
130
+ if authors:
131
+ ref['authors'] = authors[:10] # Limit to 10 authors
132
+
133
+ return ref if ref.get('title') else None
134
+
135
+
136
+ def references_to_bibtex(references: list[dict]) -> str:
137
+ """Convert references to BibTeX format."""
138
+ entries = []
139
+
140
+ for i, ref in enumerate(references):
141
+ key = f"ref{i+1}"
142
+ entry_type = "article"
143
+
144
+ fields = []
145
+ if ref.get('title'):
146
+ # Escape special characters
147
+ title = ref['title'].replace('{', '\\{').replace('}', '\\}')
148
+ fields.append(f' title = {{{title}}}')
149
+ if ref.get('authors'):
150
+ authors_str = ' and '.join(ref['authors'])
151
+ fields.append(f' author = {{{authors_str}}}')
152
+ if ref.get('year'):
153
+ fields.append(f' year = {{{ref["year"]}}}')
154
+ if ref.get('doi'):
155
+ fields.append(f' doi = {{{ref["doi"]}}}')
156
+ if ref.get('arxiv'):
157
+ fields.append(f' eprint = {{{ref["arxiv"]}}}')
158
+ fields.append(' archiveprefix = {arXiv}')
159
+
160
+ if fields:
161
+ entry = f"@{entry_type}{{{key},\n"
162
+ entry += ",\n".join(fields)
163
+ entry += "\n}"
164
+ entries.append(entry)
165
+
166
+ return "\n\n".join(entries)
167
+
168
+
169
+ def process_pdf(pdf_file) -> tuple[str, str, str]:
170
+ """Process uploaded PDF and check references.
171
+
172
+ Returns: (summary, issues_text, verified_text)
173
+ """
174
+ if pdf_file is None:
175
+ return "Please upload a PDF file.", "", ""
176
+
177
+ try:
178
+ # Extract text from PDF
179
+ doc = pymupdf.open(pdf_file.name)
180
+ full_text = ""
181
+ for page in doc:
182
+ full_text += page.get_text()
183
+ doc.close()
184
+
185
+ if not full_text.strip():
186
+ return "Could not extract text from PDF. The file might be scanned/image-based.", "", ""
187
+
188
+ # Extract references section
189
+ refs_text = extract_references_section(full_text)
190
+
191
+ # Parse references
192
+ references = parse_references_from_text(refs_text)
193
+
194
+ if not references:
195
+ return "No references could be extracted from the PDF.", "", ""
196
+
197
+ # Convert to BibTeX
198
+ bibtex = references_to_bibtex(references)
199
+
200
+ # Save to temp file and check
201
+ with tempfile.NamedTemporaryFile(mode='w', suffix='.bib', delete=False) as f:
202
+ f.write(bibtex)
203
+ bib_path = f.name
204
+
205
+ try:
206
+ issues = check_citations(bib_path, verbose=False)
207
+ issue_keys = {r.entry.key for r in issues}
208
+ finally:
209
+ Path(bib_path).unlink(missing_ok=True)
210
+
211
+ # Build results
212
+ verified = []
213
+ problems = []
214
+
215
+ for i, ref in enumerate(references):
216
+ key = f"ref{i+1}"
217
+ title = ref.get('title', 'Unknown')
218
+ authors = ', '.join(ref.get('authors', [])[:3])
219
+ if len(ref.get('authors', [])) > 3:
220
+ authors += ' et al.'
221
+ year = ref.get('year', '')
222
+
223
+ if key in issue_keys:
224
+ issue = next(r for r in issues if r.entry.key == key)
225
+ problems.append(f"**{title}**\n {authors} ({year})\n *Issue: {issue.message}*")
226
+ else:
227
+ verified.append(f"**{title}**\n {authors} ({year})")
228
+
229
+ # Summary
230
+ total = len(references)
231
+ verified_count = len(verified)
232
+ issues_count = len(problems)
233
+
234
+ summary = f"## Results\n\n"
235
+ summary += f"- **Total references found:** {total}\n"
236
+ summary += f"- **Verified:** {verified_count}\n"
237
+ summary += f"- **Issues found:** {issues_count}\n"
238
+
239
+ if issues_count == 0:
240
+ summary += "\n All references verified successfully!"
241
+ elif issues_count > total * 0.5:
242
+ summary += "\n Many issues found - some may be due to parsing errors."
243
+
244
+ issues_text = "\n\n".join(problems) if problems else "No issues found!"
245
+ verified_text = "\n\n".join(verified) if verified else "No verified references."
246
+
247
+ return summary, issues_text, verified_text
248
+
249
+ except Exception as e:
250
+ return f"Error processing PDF: {str(e)}", "", ""
251
+
252
+
253
+ def process_bibtex(bibtex_text: str) -> tuple[str, str, str]:
254
+ """Process pasted BibTeX and check references."""
255
+ if not bibtex_text.strip():
256
+ return "Please paste your BibTeX content.", "", ""
257
+
258
+ try:
259
+ # Save to temp file
260
+ with tempfile.NamedTemporaryFile(mode='w', suffix='.bib', delete=False) as f:
261
+ f.write(bibtex_text)
262
+ bib_path = f.name
263
+
264
+ try:
265
+ from reference_checker.parser import parse_bib_file
266
+ entries = parse_bib_file(bib_path)
267
+ issues = check_citations(bib_path, verbose=False)
268
+ issue_keys = {r.entry.key for r in issues}
269
+ finally:
270
+ Path(bib_path).unlink(missing_ok=True)
271
+
272
+ # Build results
273
+ verified = []
274
+ problems = []
275
+
276
+ for entry in entries:
277
+ authors = ', '.join(entry.authors[:3])
278
+ if len(entry.authors) > 3:
279
+ authors += ' et al.'
280
+
281
+ if entry.key in issue_keys:
282
+ issue = next(r for r in issues if r.entry.key == entry.key)
283
+ problems.append(f"**[{entry.key}] {entry.title}**\n {authors} ({entry.year})\n *Issue: {issue.message}*")
284
+ else:
285
+ verified.append(f"**[{entry.key}] {entry.title}**\n {authors} ({entry.year})")
286
+
287
+ # Summary
288
+ total = len(entries)
289
+ verified_count = len(verified)
290
+ issues_count = len(problems)
291
+
292
+ summary = f"## Results\n\n"
293
+ summary += f"- **Total entries:** {total}\n"
294
+ summary += f"- **Verified:** {verified_count}\n"
295
+ summary += f"- **Issues found:** {issues_count}\n"
296
+
297
+ if issues_count == 0:
298
+ summary += "\n All references verified successfully!"
299
+
300
+ issues_text = "\n\n".join(problems) if problems else "No issues found!"
301
+ verified_text = "\n\n".join(verified) if verified else "No verified references."
302
+
303
+ return summary, issues_text, verified_text
304
+
305
+ except Exception as e:
306
+ return f"Error processing BibTeX: {str(e)}", "", ""
307
+
308
+
309
+ # Build Gradio interface
310
+ with gr.Blocks(
311
+ title="HaRC - Hallucinated Reference Checker",
312
+ theme=gr.themes.Soft(primary_hue="purple"),
313
+ ) as demo:
314
+ gr.Markdown("""
315
+ # HaRC - Hallucinated Reference Checker
316
+
317
+ Verify your paper's references against academic databases.
318
+ Catches fake, misspelled, or incorrect citations before submission.
319
+
320
+ **Checks against:** Semantic Scholar, DBLP, Google Scholar, Open Library
321
+ """)
322
+
323
+ with gr.Tabs():
324
+ with gr.TabItem("Upload PDF"):
325
+ gr.Markdown("Upload your paper and we'll extract and verify the references.")
326
+ pdf_input = gr.File(label="Upload PDF", file_types=[".pdf"])
327
+ pdf_button = gr.Button("Check References", variant="primary")
328
+
329
+ with gr.Row():
330
+ pdf_summary = gr.Markdown(label="Summary")
331
+
332
+ with gr.Row():
333
+ with gr.Column():
334
+ pdf_issues = gr.Markdown(label="Issues Found")
335
+ with gr.Column():
336
+ pdf_verified = gr.Markdown(label="Verified References")
337
+
338
+ pdf_button.click(
339
+ fn=process_pdf,
340
+ inputs=[pdf_input],
341
+ outputs=[pdf_summary, pdf_issues, pdf_verified],
342
+ )
343
+
344
+ with gr.TabItem("Paste BibTeX"):
345
+ gr.Markdown("Paste your `.bib` file contents directly.")
346
+ bib_input = gr.Textbox(
347
+ label="BibTeX Content",
348
+ placeholder="@article{example2023,\n title = {Example Paper},\n author = {John Doe},\n year = {2023}\n}",
349
+ lines=10,
350
+ )
351
+ bib_button = gr.Button("Check References", variant="primary")
352
+
353
+ with gr.Row():
354
+ bib_summary = gr.Markdown(label="Summary")
355
+
356
+ with gr.Row():
357
+ with gr.Column():
358
+ bib_issues = gr.Markdown(label="Issues Found")
359
+ with gr.Column():
360
+ bib_verified = gr.Markdown(label="Verified References")
361
+
362
+ bib_button.click(
363
+ fn=process_bibtex,
364
+ inputs=[bib_input],
365
+ outputs=[bib_summary, bib_issues, bib_verified],
366
+ )
367
+
368
+ gr.Markdown("""
369
+ ---
370
+ **Note:** PDF reference extraction uses heuristics and may not be 100% accurate.
371
+ For best results, use the BibTeX tab with your actual `.bib` file.
372
+
373
+ [GitHub](https://github.com/gurusha01/HaRC) | [PyPI](https://pypi.org/project/harcx/)
374
+ """)
375
+
376
+
377
+ if __name__ == "__main__":
378
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ gradio>=4.0.0
2
+ pymupdf>=1.23.0
3
+ harcx>=0.2.0