BibGuard / README.md
thinkwee
update gradio version
8acaba1
---
title: BibGuard
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
---
# BibGuard: Bibliography & LaTeX Quality Auditor
**BibGuard** is your comprehensive quality assurance tool for academic papers. It validates bibliography entries against real-world databases and checks LaTeX submission quality to catch errors before you submit.
AI coding assistants and writing tools often hallucinate plausible-sounding but non-existent references. **BibGuard** verifies the existence of every entry against multiple databases (arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, Google Scholar) and uses advanced LLMs to ensure cited papers actually support your claims.
## πŸ›‘ Why BibGuard?
- **🚫 Stop Hallucinations**: Instantly flag citations that don't exist or have mismatched metadata
- **πŸ“‹ LaTeX Quality Checks**: Detect formatting issues, weak writing patterns, and submission compliance problems
- **πŸ”’ Safe & Non-Destructive**: Your original files are **never modified** - only detailed reports are generated
- **🧠 Contextual Relevance**: Ensure cited papers actually discuss what you claim (with LLM)
- **⚑ Efficiency Boost**: Drastically reduce time needed to manually verify hundreds of citations
## πŸš€ Features
### Bibliography Validation
- **πŸ” Multi-Source Verification**: Validates metadata against arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, and Google Scholar
- **πŸ€– AI Relevance Check**: Uses LLMs to verify citations match their context (optional)
- **πŸ“Š Preprint Detection**: Warns if >50% of references are preprints (arXiv, bioRxiv, etc.)
- **πŸ‘€ Usage Analysis**: Highlights missing citations and unused bib entries
- **πŸ‘― Duplicate Detector**: Identifies duplicate entries with fuzzy matching
### LaTeX Quality Checks
- **πŸ“ Format Validation**: Caption placement, cross-references, citation spacing, equation punctuation
- **✍️ Writing Quality**: Weak sentence starters, hedging language, redundant phrases
- **πŸ”€ Consistency**: Spelling variants (US/UK English), hyphenation, terminology
- **πŸ€– AI Artifact Detection**: Conversational AI responses, placeholder text, Markdown remnants
- **πŸ”  Acronym Validation**: Ensures acronyms are defined before use (smart matching)
- **🎭 Anonymization**: Checks for identity leaks in double-blind submissions
- **πŸ“… Citation Age**: Flags references older than 30 years
## πŸ“¦ Installation
```bash
git clone git@github.com:thinkwee/BibGuard.git
cd BibGuard
pip install -r requirements.txt
```
## ⚑ Quick Start
### 1. Initialize Configuration
```bash
python main.py --init
```
This creates `config.yaml`. Edit it to set your file paths. You have two modes:
#### Option A: Single File Mode
Best for individual papers.
```yaml
files:
bib: "paper.bib"
tex: "paper.tex"
output_dir: "bibguard_output"
```
#### Option B: Directory Scan Mode
Best for large projects or a collection of papers. BibGuard will recursively search for all `.tex` and `.bib` files.
```yaml
files:
input_dir: "./my_project_dir"
output_dir: "bibguard_output"
```
### 2. Run Full Check
```bash
python main.py
```
**Output** (in `bibguard_output/`):
- `bibliography_report.md` - Bibliography validation results
- `latex_quality_report.md` - Writing and formatting issues
- `line_by_line_report.md` - All issues sorted by line number
- `*_only_used.bib` - Clean bibliography (used entries only)
## πŸ›  Configuration
Edit `config.yaml` to customize checks:
```yaml
bibliography:
check_metadata: true # Validate against online databases (takes time)
check_usage: true # Find unused/missing entries
check_duplicates: true # Detect duplicate entries
check_preprint_ratio: true # Warn if >50% are preprints
check_relevance: false # LLM-based relevance check (requires API key)
submission:
# Format checks
caption: true # Table/figure caption placement
reference: true # Cross-reference integrity
formatting: true # Citation spacing, blank lines
equation: true # Equation punctuation, numbering
# Writing quality
sentence: true # Weak starters, hedging language
consistency: true # Spelling, hyphenation, terminology
acronym: true # Acronym definitions (3+ letters)
# Submission compliance
ai_artifacts: true # AI-generated text detection
anonymization: true # Double-blind compliance
citation_quality: true # Old citations (>30 years)
number: true # Percentage formatting
```
## πŸ€– LLM-Based Relevance Check
To verify citations match their context using AI:
```yaml
bibliography:
check_relevance: true
llm:
backend: "gemini" # Options: gemini, openai, anthropic, deepseek, ollama, vllm
api_key: "" # Or use environment variable (e.g., GEMINI_API_KEY)
```
**Supported Backends:**
- **Gemini** (Google): `GEMINI_API_KEY`
- **OpenAI**: `OPENAI_API_KEY`
- **Anthropic**: `ANTHROPIC_API_KEY`
- **DeepSeek**: `DEEPSEEK_API_KEY` (recommended for cost/performance)
- **Ollama**: Local models (no API key needed)
- **vLLM**: Custom endpoint
Then run:
```bash
python main.py
```
## πŸ“ Understanding Reports
### Bibliography Report
Shows for each entry:
- βœ… **Verified**: Metadata matches online databases
- ⚠️ **Issues**: Mismatches, missing entries, duplicates
- πŸ“Š **Statistics**: Usage, duplicates, preprint ratio
### LaTeX Quality Report
Organized by severity:
- πŸ”΄ **Errors**: Critical issues (e.g., undefined references)
- 🟑 **Warnings**: Important issues (e.g., inconsistent spelling)
- πŸ”΅ **Suggestions**: Style improvements (e.g., weak sentence starters)
### Line-by-Line Report
All LaTeX issues sorted by line number for easy fixing.
## 🧐 Understanding Mismatches
BibGuard is strict, but false positives happen:
1. **Year Discrepancy (Β±1 Year)**:
- *Reason*: Delay between preprint (arXiv) and official publication
- *Action*: Verify which version you intend to cite
2. **Author List Variations**:
- *Reason*: Different databases handle large author lists differently
- *Action*: Check if primary authors match
3. **Venue Name Differences**:
- *Reason*: Abbreviations vs. full names (e.g., "NeurIPS" vs. "Neural Information Processing Systems")
- *Action*: Both are usually correct
4. **Non-Academic Sources**:
- *Reason*: Blogs, documentation not indexed by academic databases
- *Action*: Manually verify URL and title
## πŸ”§ Advanced Options
```bash
python main.py --help # Show all options
python main.py --list-templates # List conference templates
python main.py --config my.yaml # Use custom config file
```
## 🀝 Contributing
Contributions welcome! Please open an issue or pull request.
## πŸ™ Acknowledgments
BibGuard uses multiple data sources:
- arXiv API
- CrossRef API
- Semantic Scholar API
- DBLP API
- OpenAlex API
- Google Scholar (via scholarly)
---
**Made with ❀️ for researchers who care about their submission**