|
|
--- |
|
|
title: BibGuard |
|
|
emoji: π‘οΈ |
|
|
colorFrom: blue |
|
|
colorTo: purple |
|
|
sdk: gradio |
|
|
sdk_version: 6.3.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
--- |
|
|
|
|
|
# BibGuard: Bibliography & LaTeX Quality Auditor |
|
|
|
|
|
**BibGuard** is your comprehensive quality assurance tool for academic papers. It validates bibliography entries against real-world databases and checks LaTeX submission quality to catch errors before you submit. |
|
|
|
|
|
AI coding assistants and writing tools often hallucinate plausible-sounding but non-existent references. **BibGuard** verifies the existence of every entry against multiple databases (arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, Google Scholar) and uses advanced LLMs to ensure cited papers actually support your claims. |
|
|
|
|
|
## π‘ Why BibGuard? |
|
|
|
|
|
- **π« Stop Hallucinations**: Instantly flag citations that don't exist or have mismatched metadata |
|
|
- **π LaTeX Quality Checks**: Detect formatting issues, weak writing patterns, and submission compliance problems |
|
|
- **π Safe & Non-Destructive**: Your original files are **never modified** - only detailed reports are generated |
|
|
- **π§ Contextual Relevance**: Ensure cited papers actually discuss what you claim (with LLM) |
|
|
- **β‘ Efficiency Boost**: Drastically reduce time needed to manually verify hundreds of citations |
|
|
|
|
|
## π Features |
|
|
|
|
|
### Bibliography Validation |
|
|
- **π Multi-Source Verification**: Validates metadata against arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, and Google Scholar |
|
|
- **π€ AI Relevance Check**: Uses LLMs to verify citations match their context (optional) |
|
|
- **π Preprint Detection**: Warns if >50% of references are preprints (arXiv, bioRxiv, etc.) |
|
|
- **π Usage Analysis**: Highlights missing citations and unused bib entries |
|
|
- **π― Duplicate Detector**: Identifies duplicate entries with fuzzy matching |
|
|
|
|
|
### LaTeX Quality Checks |
|
|
- **π Format Validation**: Caption placement, cross-references, citation spacing, equation punctuation |
|
|
- **βοΈ Writing Quality**: Weak sentence starters, hedging language, redundant phrases |
|
|
- **π€ Consistency**: Spelling variants (US/UK English), hyphenation, terminology |
|
|
- **π€ AI Artifact Detection**: Conversational AI responses, placeholder text, Markdown remnants |
|
|
- **π Acronym Validation**: Ensures acronyms are defined before use (smart matching) |
|
|
- **π Anonymization**: Checks for identity leaks in double-blind submissions |
|
|
- **π
Citation Age**: Flags references older than 30 years |
|
|
|
|
|
## π¦ Installation |
|
|
|
|
|
```bash |
|
|
git clone git@github.com:thinkwee/BibGuard.git |
|
|
cd BibGuard |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
## β‘ Quick Start |
|
|
|
|
|
### 1. Initialize Configuration |
|
|
|
|
|
```bash |
|
|
python main.py --init |
|
|
``` |
|
|
|
|
|
This creates `config.yaml`. Edit it to set your file paths. You have two modes: |
|
|
|
|
|
#### Option A: Single File Mode |
|
|
Best for individual papers. |
|
|
```yaml |
|
|
files: |
|
|
bib: "paper.bib" |
|
|
tex: "paper.tex" |
|
|
output_dir: "bibguard_output" |
|
|
``` |
|
|
|
|
|
#### Option B: Directory Scan Mode |
|
|
Best for large projects or a collection of papers. BibGuard will recursively search for all `.tex` and `.bib` files. |
|
|
```yaml |
|
|
files: |
|
|
input_dir: "./my_project_dir" |
|
|
output_dir: "bibguard_output" |
|
|
``` |
|
|
|
|
|
### 2. Run Full Check |
|
|
|
|
|
```bash |
|
|
python main.py |
|
|
``` |
|
|
|
|
|
**Output** (in `bibguard_output/`): |
|
|
- `bibliography_report.md` - Bibliography validation results |
|
|
- `latex_quality_report.md` - Writing and formatting issues |
|
|
- `line_by_line_report.md` - All issues sorted by line number |
|
|
- `*_only_used.bib` - Clean bibliography (used entries only) |
|
|
|
|
|
## π Configuration |
|
|
|
|
|
Edit `config.yaml` to customize checks: |
|
|
|
|
|
```yaml |
|
|
bibliography: |
|
|
check_metadata: true # Validate against online databases (takes time) |
|
|
check_usage: true # Find unused/missing entries |
|
|
check_duplicates: true # Detect duplicate entries |
|
|
check_preprint_ratio: true # Warn if >50% are preprints |
|
|
check_relevance: false # LLM-based relevance check (requires API key) |
|
|
|
|
|
submission: |
|
|
# Format checks |
|
|
caption: true # Table/figure caption placement |
|
|
reference: true # Cross-reference integrity |
|
|
formatting: true # Citation spacing, blank lines |
|
|
equation: true # Equation punctuation, numbering |
|
|
|
|
|
# Writing quality |
|
|
sentence: true # Weak starters, hedging language |
|
|
consistency: true # Spelling, hyphenation, terminology |
|
|
acronym: true # Acronym definitions (3+ letters) |
|
|
|
|
|
# Submission compliance |
|
|
ai_artifacts: true # AI-generated text detection |
|
|
anonymization: true # Double-blind compliance |
|
|
citation_quality: true # Old citations (>30 years) |
|
|
number: true # Percentage formatting |
|
|
``` |
|
|
|
|
|
## π€ LLM-Based Relevance Check |
|
|
|
|
|
To verify citations match their context using AI: |
|
|
|
|
|
```yaml |
|
|
bibliography: |
|
|
check_relevance: true |
|
|
|
|
|
llm: |
|
|
backend: "gemini" # Options: gemini, openai, anthropic, deepseek, ollama, vllm |
|
|
api_key: "" # Or use environment variable (e.g., GEMINI_API_KEY) |
|
|
``` |
|
|
|
|
|
**Supported Backends:** |
|
|
- **Gemini** (Google): `GEMINI_API_KEY` |
|
|
- **OpenAI**: `OPENAI_API_KEY` |
|
|
- **Anthropic**: `ANTHROPIC_API_KEY` |
|
|
- **DeepSeek**: `DEEPSEEK_API_KEY` (recommended for cost/performance) |
|
|
- **Ollama**: Local models (no API key needed) |
|
|
- **vLLM**: Custom endpoint |
|
|
|
|
|
Then run: |
|
|
```bash |
|
|
python main.py |
|
|
``` |
|
|
|
|
|
## π Understanding Reports |
|
|
|
|
|
### Bibliography Report |
|
|
Shows for each entry: |
|
|
- β
**Verified**: Metadata matches online databases |
|
|
- β οΈ **Issues**: Mismatches, missing entries, duplicates |
|
|
- π **Statistics**: Usage, duplicates, preprint ratio |
|
|
|
|
|
### LaTeX Quality Report |
|
|
Organized by severity: |
|
|
- π΄ **Errors**: Critical issues (e.g., undefined references) |
|
|
- π‘ **Warnings**: Important issues (e.g., inconsistent spelling) |
|
|
- π΅ **Suggestions**: Style improvements (e.g., weak sentence starters) |
|
|
|
|
|
### Line-by-Line Report |
|
|
All LaTeX issues sorted by line number for easy fixing. |
|
|
|
|
|
## π§ Understanding Mismatches |
|
|
|
|
|
BibGuard is strict, but false positives happen: |
|
|
|
|
|
1. **Year Discrepancy (Β±1 Year)**: |
|
|
- *Reason*: Delay between preprint (arXiv) and official publication |
|
|
- *Action*: Verify which version you intend to cite |
|
|
|
|
|
2. **Author List Variations**: |
|
|
- *Reason*: Different databases handle large author lists differently |
|
|
- *Action*: Check if primary authors match |
|
|
|
|
|
3. **Venue Name Differences**: |
|
|
- *Reason*: Abbreviations vs. full names (e.g., "NeurIPS" vs. "Neural Information Processing Systems") |
|
|
- *Action*: Both are usually correct |
|
|
|
|
|
4. **Non-Academic Sources**: |
|
|
- *Reason*: Blogs, documentation not indexed by academic databases |
|
|
- *Action*: Manually verify URL and title |
|
|
|
|
|
## π§ Advanced Options |
|
|
|
|
|
```bash |
|
|
python main.py --help # Show all options |
|
|
python main.py --list-templates # List conference templates |
|
|
python main.py --config my.yaml # Use custom config file |
|
|
``` |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
Contributions welcome! Please open an issue or pull request. |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
BibGuard uses multiple data sources: |
|
|
- arXiv API |
|
|
- CrossRef API |
|
|
- Semantic Scholar API |
|
|
- DBLP API |
|
|
- OpenAlex API |
|
|
- Google Scholar (via scholarly) |
|
|
|
|
|
--- |
|
|
|
|
|
**Made with β€οΈ for researchers who care about their submission** |
|
|
|