A newer version of the Gradio SDK is available:
6.9.0
title: CiteScan
emoji: ๐
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
CiteScan: Check References, Confirm Truth.
CiteScan is an open-source and free tool designed to detect hallucinated references in academic writing. As AI coding assistants and writing tools become more prevalent, they sometimes generate plausible-sounding citations that do not actually exist. CiteScan addresses this issue by validating every bibliography entry against multiple authoritative academic databasesโincluding arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, and Google Scholarโto confirm their authenticity.
Going beyond simple verification, CiteScan uses rule-based algorithms to analyze whether the cited papers genuinely support the claims made in your text. Thanks to the free accessibility for academic databases across CS and AI areas, our system will cost $0 for maintenance after development.
๐ Quick Start
Option 1: Web Interface (Gradio)
# Install dependencies
pip install -r requirements.txt
# Run Gradio interface
python app.py
Access at http://localhost:7860
Option 2: API Service (FastAPI)
# Install dependencies
pip install -r requirements.txt
# Run API service
python main.py
Access API at http://localhost:8000
API Documentation at http://localhost:8000/docs
Option 3: Docker
# Run both services with Docker Compose
docker-compose up -d
# Gradio: http://localhost:7860
# API: http://localhost:8000
๐ Documentation
- API Documentation - Complete API reference and examples
- Deployment Guide - Production deployment instructions
๐ก Why CiteScan?
๐ซ NO Hallucinations: Annotate citations that don't exist or have mismatched metadata across year, authors, and title.
๐ Ground Truth Reference: Provide the link if the citations are flagged to issued entry. You can click the Open paper or DOI button to access the real-world metadata, and then cite the BibTeX from the press website.
๐ Top-tier Research Organizations: Cooperate with National University of Singapore (NUS) and Shanghai Jiao Tong University (SJTU).
๐ RESTful API: Production-ready API for integration with other tools and services.
โจ Features
Web Interface (Gradio)
- User-friendly interface for manual verification
- Real-time progress tracking
- Interactive filtering by verification status
- Visual presentation of results
API Service (FastAPI)
- RESTful API for programmatic access
- Automatic OpenAPI documentation
- JSON responses for easy integration
- Health checks and monitoring endpoints
- Structured logging
- Caching for improved performance
๐ References Validation
Multi-Source Verification: Validates metadata against arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, and Google Scholar.
Covert citation from pre-print version to official version: After clicking the blue button (
Open paperorDOI), the official website will display. Click thecitebutton, you can copy the official BibTex.
Verification Workflow
- Parse BibTeX: Extract entries and metadata
- Priority-based Search: Query databases in priority order
- Metadata Comparison: Compare title, authors, year, venue
- Duplicate Detection: Identify duplicate entries
- Result Generation: Provide detailed verification report
๐ API Usage Examples
Python
import requests
url = "http://localhost:8000/api/v1/verify"
bibtex = """
@article{vaswani2017attention,
title={Attention is all you need},
author={Vaswani, Ashish and Shazeer, Noam},
year={2017}
}
"""
response = requests.post(url, json={"bibtex_content": bibtex})
result = response.json()
print(f"Verified: {result['verified_count']}/{result['total_count']}")
cURL
curl -X POST "http://localhost:8000/api/v1/verify" \
-H "Content-Type: application/json" \
-d '{"bibtex_content": "@article{example,title={Test},year={2023}}"}'
See API_DOCS.md for complete API documentation.
โ๏ธ Configuration
Create a .env file from the template:
cp .env.example .env
Key configuration options:
# Server ports
API_PORT=8000
GRADIO_PORT=7860
# Performance
MAX_WORKERS=10
CACHE_ENABLED=true
CACHE_TTL=3600
# Logging
LOG_LEVEL=INFO
LOG_FORMAT=json
See DEPLOYMENT.md for complete configuration guide.
๐๏ธ Architecture
CiteScan/
โโโ src/
โ โโโ api/ # FastAPI routes and schemas
โ โโโ services/ # Business logic layer
โ โโโ core/ # Configuration, logging, cache
โ โโโ fetchers/ # Database API clients
โ โโโ analyzers/ # Metadata comparison
โ โโโ parsers/ # BibTeX parsing
โ โโโ utils/ # Utilities
โโโ app.py # Gradio interface
โโโ main.py # FastAPI application
โโโ Dockerfile # Container configuration
โโโ docker-compose.yml # Multi-service setup
๐ง Development
Setup Development Environment
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy environment template
cp .env.example .env
# Run in development mode
ENVIRONMENT=development python main.py
Project Structure
- Services Layer: Reusable business logic
- API Layer: RESTful endpoints with FastAPI
- UI Layer: Gradio interface
- Core: Configuration, logging, caching
- Fetchers: Database API integrations
๐ Monitoring
Health Check
curl http://localhost:8000/api/v1/health
Statistics
curl http://localhost:8000/api/v1/stats
Logs
Logs are stored in logs/citescan.log in JSON format:
tail -f logs/citescan.log | jq '.'
โ ๏ธ Case Study for False Positives
Authors Mismatch:
- Reason: Different databases deal with a longer list of authors with different strategies, like truncation.
- Action: Verify if main authors match
Venues Mismatch:
- Reason: Abbreviations vs. full names, such as "ICLR" vs. "International Conference on Learning Representations"
- Action: Both are correct.
Year GAP (ยฑ1 Year):
- Reason: Delay between preprint (arXiv) and final version publication
- Action: Verify which version you intend to cite. We recommend citing the version from the official press website. Lower pre-print version bib will make your submission more convincing.
Non-academic Sources:
- Reason: Blogs and APIs are not indexed in academic databases.
- Action: Verify URL, year, and title manually.
๐ Acknowledgments
CiteScan uses multiple data sources:
- arXiv API
- CrossRef API
- Semantic Scholar API
- DBLP API
- OpenAlex API
- Google Scholar (web scraping)
๐ License
[Add your license here]
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ง Contact
For questions and support:
- Email: e1143641@u.nus.edu
- GitHub Issues: [Repository URL]
๐ ModelScope Deployment
To deploy on ModelScope ๅ็ฉบ้ด:
# Add ModelScope remote
git remote add modelscope "http://oauth2:YOUR_TOKEN@www.modelscope.cn/studios/YOUR_USERNAME/CiteScan.git"
# Push to ModelScope
git push modelscope main
# Or force push if needed
git push modelscope main --force
After pushing, visit your ModelScope studio and click "ไธ็บฟ็ฉบ้ดๅฑ็คบ" or "็ซๅณๅๅธ" to deploy the Gradio application.
๐ Hugging Face Spaces ้จ็ฝฒ
ๅฐไปฃ็ ๆจ้ๅฐ Hugging Face Spaces๏ผ
ๅฎ่ฃ Hugging Face CLI ๅนถ็ปๅฝ๏ผๅฆๆชๅฎ่ฃ ๏ผ๏ผ
pip install huggingface_hub huggingface-cli loginๆทปๅ Hugging Face ่ฟ็จไปๅบ๏ผ
git remote add hf https://huggingface.co/spaces/yancan/CiteScanๆจ้ๅฐ Spaces๏ผHF ไธๅ ่ฎธๆฎ้ git ๆจ้ไบ่ฟๅถๆไปถ๏ผ้็จๆ ๅพ็ๅๆฏ
hf-main๏ผ๏ผ- ้่ฆ๏ผHF ไธๆพ็คบ็ๆฏ ๅทฒๆไบคๅฐ main ็ไปฃ็ ใ่ฅๆฌๅฐๆๆชๆไบค็ไฟฎๆน๏ผๅฆ
main.pyใsrc/็ญ๏ผ๏ผ้ๅ ๆไบคๅฐmain๏ผๅๆดๆฐๅนถๆจ้hf-mainใ - ไธ้ฎ่ๆฌ๏ผ
./scripts/push_to_hf.sh๏ผไผๆ็คบๅ ๆไบคๆชๆไบค็ไฟฎๆน๏ผๅ้ๅปบhf-mainๅนถๆจ้๏ผใ - ๆๆๅจ๏ผๅ
git add -A && git commit -m "่ฏดๆ"๏ผๅ่ฟ่ก่ๆฌๆๆ่ๆฌๅ ๆญฅ้ชค้ๅปบhf-mainๅนถgit push hf hf-main:main --forceใ
- ้่ฆ๏ผHF ไธๆพ็คบ็ๆฏ ๅทฒๆไบคๅฐ main ็ไปฃ็ ใ่ฅๆฌๅฐๆๆชๆไบค็ไฟฎๆน๏ผๅฆ
ๆจ้ๅฎๆๅ๏ผๅจ Space ้กต้ข ็ญๅพ ๆๅปบ็ปๆๅณๅฏ่ฎฟ้ฎ Gradio ๅบ็จใ
ๆณจๆ๏ผREADME ้กถ้จ็ YAML ้
็ฝฎ๏ผtitleใsdkใapp_file ็ญ๏ผไธบ Spaces ๅฟ
้๏ผ่ฏทๅฟๅ ้คใ

