File size: 7,148 Bytes
0fe586d
 
 
2477d09
 
0fe586d
8acaba1
0fe586d
 
 
 
46df5f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6984298
 
46df5f0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: BibGuard
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
---

# BibGuard: Bibliography & LaTeX Quality Auditor

**BibGuard** is your comprehensive quality assurance tool for academic papers. It validates bibliography entries against real-world databases and checks LaTeX submission quality to catch errors before you submit.

AI coding assistants and writing tools often hallucinate plausible-sounding but non-existent references. **BibGuard** verifies the existence of every entry against multiple databases (arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, Google Scholar) and uses advanced LLMs to ensure cited papers actually support your claims.

## πŸ›‘ Why BibGuard?

-   **🚫 Stop Hallucinations**: Instantly flag citations that don't exist or have mismatched metadata
-   **πŸ“‹ LaTeX Quality Checks**: Detect formatting issues, weak writing patterns, and submission compliance problems
-   **πŸ”’ Safe & Non-Destructive**: Your original files are **never modified** - only detailed reports are generated
-   **🧠 Contextual Relevance**: Ensure cited papers actually discuss what you claim (with LLM)
-   **⚑ Efficiency Boost**: Drastically reduce time needed to manually verify hundreds of citations

## πŸš€ Features

### Bibliography Validation
-   **πŸ” Multi-Source Verification**: Validates metadata against arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, and Google Scholar
-   **πŸ€– AI Relevance Check**: Uses LLMs to verify citations match their context (optional)
-   **πŸ“Š Preprint Detection**: Warns if >50% of references are preprints (arXiv, bioRxiv, etc.)
-   **πŸ‘€ Usage Analysis**: Highlights missing citations and unused bib entries
-   **πŸ‘― Duplicate Detector**: Identifies duplicate entries with fuzzy matching

### LaTeX Quality Checks
-   **πŸ“ Format Validation**: Caption placement, cross-references, citation spacing, equation punctuation
-   **✍️ Writing Quality**: Weak sentence starters, hedging language, redundant phrases
-   **πŸ”€ Consistency**: Spelling variants (US/UK English), hyphenation, terminology
-   **πŸ€– AI Artifact Detection**: Conversational AI responses, placeholder text, Markdown remnants
-   **πŸ”  Acronym Validation**: Ensures acronyms are defined before use (smart matching)
-   **🎭 Anonymization**: Checks for identity leaks in double-blind submissions
-   **πŸ“… Citation Age**: Flags references older than 30 years

## πŸ“¦ Installation

```bash
git clone git@github.com:thinkwee/BibGuard.git
cd BibGuard
pip install -r requirements.txt
```

## ⚑ Quick Start

### 1. Initialize Configuration

```bash
python main.py --init
```

This creates `config.yaml`. Edit it to set your file paths. You have two modes:

#### Option A: Single File Mode
Best for individual papers.
```yaml
files:
  bib: "paper.bib"
  tex: "paper.tex"
  output_dir: "bibguard_output"
```

#### Option B: Directory Scan Mode
Best for large projects or a collection of papers. BibGuard will recursively search for all `.tex` and `.bib` files.
```yaml
files:
  input_dir: "./my_project_dir"
  output_dir: "bibguard_output"
```

### 2. Run Full Check

```bash
python main.py
```

**Output** (in `bibguard_output/`):
- `bibliography_report.md` - Bibliography validation results
- `latex_quality_report.md` - Writing and formatting issues
- `line_by_line_report.md` - All issues sorted by line number
- `*_only_used.bib` - Clean bibliography (used entries only)

## πŸ›  Configuration

Edit `config.yaml` to customize checks:

```yaml
bibliography:
  check_metadata: true        # Validate against online databases (takes time)
  check_usage: true           # Find unused/missing entries
  check_duplicates: true      # Detect duplicate entries
  check_preprint_ratio: true  # Warn if >50% are preprints
  check_relevance: false      # LLM-based relevance check (requires API key)

submission:
  # Format checks
  caption: true               # Table/figure caption placement
  reference: true             # Cross-reference integrity
  formatting: true            # Citation spacing, blank lines
  equation: true              # Equation punctuation, numbering
  
  # Writing quality
  sentence: true              # Weak starters, hedging language
  consistency: true           # Spelling, hyphenation, terminology
  acronym: true               # Acronym definitions (3+ letters)
  
  # Submission compliance
  ai_artifacts: true          # AI-generated text detection
  anonymization: true         # Double-blind compliance
  citation_quality: true      # Old citations (>30 years)
  number: true                # Percentage formatting
```

## πŸ€– LLM-Based Relevance Check

To verify citations match their context using AI:

```yaml
bibliography:
  check_relevance: true

llm:
  backend: "gemini"  # Options: gemini, openai, anthropic, deepseek, ollama, vllm
  api_key: ""        # Or use environment variable (e.g., GEMINI_API_KEY)
```

**Supported Backends:**
- **Gemini** (Google): `GEMINI_API_KEY`
- **OpenAI**: `OPENAI_API_KEY`
- **Anthropic**: `ANTHROPIC_API_KEY`
- **DeepSeek**: `DEEPSEEK_API_KEY` (recommended for cost/performance)
- **Ollama**: Local models (no API key needed)
- **vLLM**: Custom endpoint

Then run:
```bash
python main.py
```

## πŸ“ Understanding Reports

### Bibliography Report
Shows for each entry:
- βœ… **Verified**: Metadata matches online databases
- ⚠️ **Issues**: Mismatches, missing entries, duplicates
- πŸ“Š **Statistics**: Usage, duplicates, preprint ratio

### LaTeX Quality Report
Organized by severity:
- πŸ”΄ **Errors**: Critical issues (e.g., undefined references)
- 🟑 **Warnings**: Important issues (e.g., inconsistent spelling)
- πŸ”΅ **Suggestions**: Style improvements (e.g., weak sentence starters)

### Line-by-Line Report
All LaTeX issues sorted by line number for easy fixing.

## 🧐 Understanding Mismatches

BibGuard is strict, but false positives happen:

1.  **Year Discrepancy (Β±1 Year)**:
    - *Reason*: Delay between preprint (arXiv) and official publication
    - *Action*: Verify which version you intend to cite

2.  **Author List Variations**:
    - *Reason*: Different databases handle large author lists differently
    - *Action*: Check if primary authors match

3.  **Venue Name Differences**:
    - *Reason*: Abbreviations vs. full names (e.g., "NeurIPS" vs. "Neural Information Processing Systems")
    - *Action*: Both are usually correct

4.  **Non-Academic Sources**:
    - *Reason*: Blogs, documentation not indexed by academic databases
    - *Action*: Manually verify URL and title

## πŸ”§ Advanced Options

```bash
python main.py --help              # Show all options
python main.py --list-templates    # List conference templates
python main.py --config my.yaml    # Use custom config file
```

## 🀝 Contributing

Contributions welcome! Please open an issue or pull request.

## πŸ™ Acknowledgments

BibGuard uses multiple data sources:
- arXiv API
- CrossRef API
- Semantic Scholar API
- DBLP API
- OpenAlex API
- Google Scholar (via scholarly)

---

**Made with ❀️ for researchers who care about their submission**