Spaces:
Sleeping
Sleeping
| title: BibGuard | |
| emoji: π‘οΈ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 6.3.0 | |
| app_file: app.py | |
| pinned: false | |
| # BibGuard: Bibliography & LaTeX Quality Auditor | |
| **BibGuard** is a comprehensive quality-assurance tool for academic papers. It validates every bibliography entry against real-world databases, checks LaTeX submission quality, flags retracted DOIs and broken URLs, and uses an LLM (optional) to verify that cited papers actually support your claims. | |
| AI coding assistants and writing tools often hallucinate plausible-sounding but non-existent references. **BibGuard** verifies the existence of every entry against multiple databases (arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, Google Scholar) and produces a single, beautiful, self-contained HTML report you can open offline. | |
| ## π‘ Why BibGuard? | |
| - **π« Stop Hallucinations**: Instantly flag citations that don't exist or have mismatched metadata | |
| - **π« Catch Retractions**: Detect references to papers that have been retracted or are under "expression of concern" | |
| - **π Detect Broken URLs**: HEAD-check `entry.url` to find dead links before reviewers do | |
| - **π LaTeX Quality Checks**: Detect formatting issues, weak writing patterns, double-blind compliance, AI-text artifacts | |
| - **π Safe & Non-Destructive**: Your original files are **never modified** β only reports are generated | |
| - **π§ Contextual Relevance** *(optional, with LLM)*: Score each citation 1-5 and tag its role (baseline/method/dataset/counterexample/survey/motivation/other) | |
| - **β‘ Re-runs are fast**: SQLite-backed HTTP cache + auto-retry mean the second run on the same paper completes in seconds | |
| ## π Features | |
| ### Bibliography Validation | |
| - **π Multi-Source Verification**: Validates metadata against arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, and Google Scholar | |
| - **π« Retraction Detection**: Flags retracted/withdrawn DOIs via CrossRef's `update-to` relation | |
| - **π URL Liveness Check**: Optional HEAD-then-GET check on every `entry.url` | |
| - **π Preprint Detection**: Warns if >50% of references are preprints, and suggests published versions when arXiv records them | |
| - **π Usage Analysis**: Highlights missing citations and unused bib entries | |
| - **π― Duplicate Detection**: Identifies duplicate entries with fuzzy matching | |
| - **π€ AI Relevance + Role Tagging** *(optional)*: 1-5 relevance score plus citation role classification | |
| ### LaTeX Quality Checks | |
| - **π Format Validation**: Caption placement, cross-references, citation spacing, equation punctuation | |
| - **βοΈ Writing Quality**: Weak sentence starters, hedging language, redundant phrases | |
| - **π€ Consistency**: Spelling variants (US/UK English), hyphenation, terminology β augmentable via project glossary | |
| - **π€ AI Artifact Detection**: Conversational AI responses, placeholder text, Markdown remnants | |
| - **π Acronym Validation**: Ensures acronyms are defined before use, with a project-glossary skip list | |
| - **π Anonymization**: Checks for identity leaks in double-blind submissions | |
| - **π Citation Age**: Flags references older than 30 years | |
| - **π Conference Templates**: Mandatory-section and style-package checks for ACL, EMNLP, NAACL, CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR | |
| ### Outputs | |
| - π **Markdown reports** β bibliography validation + LaTeX quality issues | |
| - π **Self-contained HTML** β dark mode, full-text search, per-section severity filters, inline highlighting of the offending span on each LaTeX issue. Opens offline, no server required | |
| - π€ **JSON** for CI / scripts / custom dashboards | |
| - π§Ή **Cleaned `.bib`** containing only entries actually cited in the paper | |
| ## π¦ Installation | |
| ```bash | |
| git clone git@github.com:thinkwee/BibGuard.git | |
| cd BibGuard | |
| pip install -r requirements.txt | |
| ``` | |
| ## β‘ Quick Start | |
| ### 1. Initialize Configuration | |
| ```bash | |
| python main.py --init | |
| ``` | |
| This creates `config.yaml`. Edit it to point at your `.bib` and `.tex` files. | |
| #### Single File Mode | |
| ```yaml | |
| files: | |
| bib: "paper.bib" | |
| tex: "paper.tex" | |
| output_dir: "bibguard_output" | |
| ``` | |
| #### Directory Scan Mode | |
| For projects with multiple `.tex` and `.bib` files: | |
| ```yaml | |
| files: | |
| input_dir: "./my_project_dir" | |
| output_dir: "bibguard_output" | |
| ``` | |
| ### 2. Run a Check | |
| ```bash | |
| python main.py # full check using config.yaml / bibguard.yaml | |
| python main.py --quick # local-only checks (no network, instant) | |
| python main.py --format json,html # pick output formats | |
| python main.py --verbose # DEBUG logs to stderr | |
| python main.py --config my.yaml # custom config path | |
| python main.py --list-templates # list conference templates | |
| ``` | |
| **Default outputs** (in `bibguard_output/`): | |
| - `report.html` β single self-contained HTML, opens offline, dark-mode aware | |
| - `report.json` β full machine-readable dump (only when `json` is in `output.formats`) | |
| - `bibliography_report.md` β bibliography validation, with corroboration notes | |
| - `latex_quality_report.md` β LaTeX quality issues, errors / warnings / suggestions, full line content with the offending span bolded | |
| - `<bibname>_only_used.bib` β clean bibliography of cited entries only | |
| ## π Configuration | |
| `bibguard.yaml` (or `config.yaml`) contains the following sections: | |
| ```yaml | |
| files: | |
| bib: "paper.bib" | |
| tex: "paper.tex" | |
| output_dir: "bibguard_output" | |
| network: | |
| contact_email: "" # used in polite-pool User-Agent for arXiv/CrossRef/OpenAlex | |
| cache_enabled: true # local SQLite cache for HTTP responses (~/.cache/bibguard) | |
| cache_ttl_hours: 24 | |
| retry_total: 5 # auto-retry on 429/5xx with exponential backoff | |
| retry_backoff_factor: 1.5 | |
| template: "" # acl | emnlp | naacl | cvpr | iccv | eccv | neurips | icml | iclr | |
| bibliography: | |
| check_metadata: true # verify against online databases (slow on first run, fast on repeats) | |
| check_usage: true # find unused entries / missing citations | |
| check_duplicates: true | |
| check_preprint_ratio: true # warn if >50% of references are preprints | |
| check_relevance: false # LLM-based relevance check (requires API key) | |
| submission_extra: | |
| url_liveness: false # HEAD-check every entry.url field (slow) | |
| retraction: true # flag retracted DOIs via CrossRef | |
| submission: # 11 LaTeX checkers β toggle each independently | |
| caption: true | |
| reference: true | |
| formatting: true | |
| equation: true | |
| ai_artifacts: true | |
| sentence: true | |
| consistency: true | |
| acronym: true | |
| number: true | |
| citation_quality: true | |
| anonymization: true | |
| # Project glossary feeds the consistency / acronym checkers. | |
| glossary: | |
| preferred: | |
| - "Transformer" | |
| - "fine-tuning" | |
| acronyms: | |
| NLP: "Natural Language Processing" | |
| LLM: "Large Language Model" | |
| llm: | |
| backend: "gemini" # gemini | openai | anthropic | deepseek | ollama | vllm | |
| model: "" # leave empty for sensible default per backend | |
| api_key: "" # PREFER env var: $GEMINI_API_KEY / $OPENAI_API_KEY / etc. | |
| output: | |
| quiet: false | |
| minimal_verified: false | |
| formats: [markdown, html] # any of: markdown, html, json | |
| ``` | |
| ## π€ LLM-Based Relevance + Role Tagging | |
| When `bibliography.check_relevance` is `true`, BibGuard sends each citation's surrounding context plus the cited paper's abstract to your chosen LLM. The model returns a 1-5 relevance score, an `is_relevant` boolean, a one-sentence explanation, and a **citation role**: | |
| - `baseline` β cited as a comparison/baseline | |
| - `method` β cited paper introduces a method this one builds on | |
| - `dataset` β provides a dataset/benchmark used here | |
| - `counterexample` β cited to argue against | |
| - `survey` β cited as a survey/overview | |
| - `motivation` β cited to motivate the problem | |
| - `other` | |
| **Supported backends**: Gemini, OpenAI, Anthropic, DeepSeek, Ollama (local), vLLM (custom endpoint). | |
| **API keys**: read from environment variables by convention β `GEMINI_API_KEY`, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `DEEPSEEK_API_KEY`. Set them in your shell rather than committing `api_key:` to `bibguard.yaml`. | |
| ## π Web UI | |
| ```bash | |
| python app.py | |
| ``` | |
| Opens at `http://localhost:7860`. The web UI mirrors the CLI but with a streaming status panel and three presets: | |
| - **Quick** β local checks only, no network, instant | |
| - **Standard** β local + retraction lookup (CrossRef) | |
| - **Strict** β adds multi-source metadata fetch + URL liveness (slow on first run; subsequent runs are cached) | |
| The toolbar fits in one row: file uploads, preset chips, and Run / Stop. Per-check overrides live in the **Advanced** accordion. The report renders inline as a self-contained iframe so the page stays stable while entries stream in. Downloads (HTML, Markdown bib, JSON, cleaned `.bib`, `bibguard.log`) appear in the **Downloads** accordion below. | |
| Set `BIBGUARD_CONTACT_EMAIL=you@example.com` in your shell to use a real contact in the polite-pool User-Agent. | |
| ## πͺ Pre-commit Hook | |
| To run BibGuard automatically before each commit that touches `.tex` or `.bib`: | |
| ```bash | |
| cd /path/to/your-paper-repo | |
| bash /path/to/BibGuard/scripts/install-hook.sh | |
| ``` | |
| Skip the hook for one commit with `git commit --no-verify`. | |
| ## π Understanding Reports | |
| ### Self-Contained HTML (`report.html`) | |
| The recommended output. Single file, no external assets, dark-mode aware. Includes: | |
| - Three tabs: **Bibliography** Β· **LaTeX Quality** Β· **Retractions / URLs** | |
| - **Per-section filter chips** β bibliography filters by Verified / Unverified / Unused; LaTeX quality filters by Errors / Warnings / Info | |
| - **Full-text search** across titles, authors, keys, and messages β works inside the active tab | |
| - **Inline span highlighting** β for LaTeX issues that come from a regex (e.g., `\cite{}` without `~`), the offending substring is wrapped in `<mark>` so you can see exactly *where* in the line to look | |
| - **Honest empty states** β Retractions / URL liveness panels report how many entries actually carried a `doi=` / `url=` field, so an empty result no longer looks like the check failed silently | |
| - Theme toggle that overrides system preference | |
| ### Markdown Reports | |
| Two files for granular review and code review tooling: | |
| - `bibliography_report.md` β every entry with metadata-match status, including positive **corroboration notes** when a second source agreed | |
| - `latex_quality_report.md` β issues grouped by checker and severity, full line content with the offending span bolded | |
| ### JSON Output | |
| Machine-readable dump for CI integration. Top-level keys: `meta`, `summary`, `entries`, `submission_results`, `retractions`, `url_findings`, `duplicates`, `missing_citations`. | |
| ## π§ Understanding Mismatches | |
| BibGuard is strict, but false positives happen: | |
| 1. **Year Discrepancy (Β±1 Year)** β preprint vs. official publication. Verify which version you intend to cite. | |
| 2. **Author List Variations** β different databases truncate large author lists differently. Check primary authors. | |
| 3. **Venue Name Differences** β abbreviations vs. full names (e.g., "NeurIPS" vs. "Neural Information Processing Systems"). Both usually correct. | |
| 4. **Non-Academic Sources** β blogs and documentation aren't indexed by academic databases. Verify URL and title manually. | |
| ## π§ Performance Notes | |
| - **First run** with `check_metadata: true` on ~100 entries: 1-3 minutes (rate-limited by arXiv/CrossRef). | |
| - **Re-runs**: seconds, thanks to the SQLite HTTP cache at `~/.cache/bibguard/http_cache.sqlite` (TTL 24h by default). | |
| - **Quick mode** (`python main.py --quick`) bypasses all network calls; runs in <1 second on most papers. | |
| - **Retraction lookup** is concurrent; ~5-10 seconds for 100 entries with cache cold. | |
| ### Hostile networks (HF Spaces, restricted egress) | |
| BibGuard's networking is tuned for "fail fast, then circuit-break": | |
| - urllib3 retries are restricted to genuine HTTP 5xx β connection resets and read timeouts are **not** retried, so a blocked source fails in 1-3 s instead of 20+ s. | |
| - The application-level circuit breaker trips after **2** consecutive failures and skips that source for the rest of the run. | |
| If you know in advance that a source won't work from your deploy (e.g. HF Spaces' egress IPs are routinely blocked by DBLP and `export.arxiv.org`), pre-disable them so the run never even tries: | |
| ```bash | |
| export BIBGUARD_DISABLE_SOURCES="dblp,arxiv" | |
| python app.py # or main.py | |
| ``` | |
| Comma- or space-separated, case-insensitive. Other sources (CrossRef, Semantic Scholar, OpenAlex) keep working. | |
| ## π€ Contributing | |
| Contributions welcome. Open an issue or pull request. | |
| ## π Acknowledgments | |
| BibGuard uses the following data sources: | |
| - [arXiv API](https://info.arxiv.org/help/api/index.html) | |
| - [CrossRef REST API](https://api.crossref.org) | |
| - [Semantic Scholar Graph API](https://api.semanticscholar.org) | |
| - [DBLP API](https://dblp.org/faq/How+to+use+the+dblp+search+API.html) | |
| - [OpenAlex API](https://docs.openalex.org) | |
| - Google Scholar (via scraping; rate-limited) | |
| --- | |
| **Made with β€οΈ for researchers who care about their submission** | |