Spaces:

Jaykay73
/

bitcheck-document

Sleeping

App Files Files Community

bitcheck-document / README.md

AI Assistant

Update Bitcheck Document Service and test.html

37a1755 28 days ago

preview code

raw

history blame contribute delete

14.1 kB

metadata

title: Bitcheck Document
emoji: 📊
colorFrom: blue
colorTo: pink
sdk: docker
pinned: false

BitCheck Document Verification API

BitCheck Document Verification API is a FastAPI service for risk-based document verification. It accepts PDF and image documents, runs a sequence of local analysis modules, optionally adds DeepSeek reasoning when configured, and returns a structured trust report for review workflows.

The service is designed to run locally or on Hugging Face Spaces using Docker on port 7860.

Architecture

Client
  |
  v
FastAPI /verify/document
  |
  +--> File validation and safe storage
  |
  +--> PDF processor or image processor
  |
  +--> Metadata analyzer
  |
  +--> OCR service
  |
  +--> PDF/OCR text consistency checker
  |
  +--> QR decoder and structural URL analyzer
  |
  +--> Visual forensic analyzer
  |
  +--> Rule-based field extractor
  |
  +--> Content risk analyzer
  |       |
  |       +--> Optional DeepSeek reasoning and document context
  |
  +--> LLM-guided field refinement
  |
  +--> Context-aware dynamic trust scorer
  |
  +--> Report builder
  |
  v
Risk-based BitCheck JSON report

Features

FastAPI document verification endpoint.
Safe upload validation by extension and file signature.
PDF text-layer extraction and page rendering.
Image normalization for OCR, QR, and forensic modules.
Metadata analysis for editing tools, AI tools, timestamps, camera data, and GPS.
OCR with graceful fallback when Tesseract is unavailable.
PDF text versus OCR text consistency scoring.
QR code detection and structural URL risk analysis.
Visual forensic risk signals and annotated output images.
Rule-based document type inference and structured field extraction.
Heuristic content risk analysis for fraud-like wording, with LLM context when available.
Optional DeepSeek reasoning when an API key is configured.
LLM-guided document-type refinement for cases where local rules infer the wrong category.
Context-aware trust scoring with normalized module weights.
Final report builder with limitations, warnings, risk flags, recommended actions, and relative output paths.

Supported File Types

PDF: .pdf
Images: .jpg, .jpeg, .png, .webp

Unsupported file types return a clean 400 response with status: "failed".

How Each Module Works

File Validation

The file validator checks the filename extension, reads the file with a size limit, validates the file signature, computes a SHA-256 hash, and stores the upload using a generated safe filename.

PDF Processing

PDF files are opened with PyMuPDF. The service extracts embedded page text, renders each processed page to an output image, records document structure signals, and limits processing to MAX_PDF_PAGES.

Image Processing

Image files are opened with Pillow, EXIF metadata is collected where available, and the image is normalized to RGB PNG for downstream modules.

Metadata Analysis

Metadata is checked for known editing software, AI-generation tool names, creation/modification timestamps, camera metadata, and GPS fields. These signals affect risk but do not prove manipulation by themselves.

OCR

OCR uses pytesseract when the Tesseract binary is available. If OCR is disabled or unavailable, BitCheck records a warning and continues.

Text Consistency

For PDFs, embedded text is compared against visible OCR text. A low match raises a review signal because it can indicate an image overlay, replacement, or extraction mismatch.

QR Code Checks

BitCheck decodes QR codes and barcodes from rendered page images. For QR URL payloads, it analyzes structure only:

HTTPS versus HTTP
shortened URLs
IP address hosts
private or internal addresses
suspicious keywords such as login, payment, wallet, password, OTP, or claim
suspicious TLDs
punycode domains
excessive hyphenation
unusually deep subdomains

BitCheck does not browse or open QR destinations in the verification pipeline. QR detection does not mean the linked source is authentic unless the issuer is verified through an official channel.

Visual Forensics

The forensic analyzer uses OpenCV and NumPy to estimate visual inconsistency risk from local sharpness, noise, edge density, compression artifacts, brightness, and contrast. It can generate annotated review images. These are risk signals, not court-grade evidence.

Field Extraction

The field extractor uses regex and keyword rules to infer a document type and extract expected fields. Supported types include:

certificate
academic_result
invoice
receipt
business_registration
identity_document
bank_statement
admission_letter
result_slip
contract
academic_publication
report
general

It computes missing expected fields, field confidence, and field risk.

When DeepSeek is available, BitCheck can use deepseek_analysis.document_type_inferred to re-run field extraction with a better document type. For example, if local rules initially treat an academic article like an identity document, the LLM-inferred academic_publication type prevents irrelevant missing-field penalties such as missing date of birth or expiry date.

Content Risk

The content risk analyzer always runs local heuristics for:

urgency wording
suspicious payment instructions
fake grant or scholarship wording
BVN, NIN, OTP, password, PIN, or verification-code requests
attempts to bypass official channels
unrealistic claims

When DeepSeek classifies a document as a contextual long-form document, such as academic_publication or report, BitCheck reduces keyword-only false positives. Terms such as bank, transfer, NIN, PIN, or fraud may be valid research/reporting context rather than direct fraud instructions.

Trust Scoring

The trust scorer combines available numeric module risks using normalized weights:

metadata_risk:         0.12
pdf_structure_risk:    0.10
text_consistency_risk: 0.18
qr_risk:               0.15
forensic_risk:         0.20
field_risk:            0.12
content_risk:          0.13

Only modules with numeric risk scores are included. Missing modules are not over-penalized. If too little evidence is available, the trust level is capped at review.

For contextual long-form documents, the trust scorer also considers DeepSeek's inferred document type. If the text layer, OCR consistency, metadata, and content-risk context are low risk, visual forensic findings are downgraded to manual-review warnings instead of forcing a high-risk block by themselves. This avoids treating normal tables, embedded figures, or image grids in academic PDFs as proof of tampering.

Trust levels:

80-100: Likely Authentic, approve
60-79:  Low Risk, approve
40-59:  Suspicious, review
20-39:  High Risk, block_or_manual_review
0-19:   Very High Risk, block_or_manual_review

DeepSeek Usage

DeepSeek is optional. When DEEPSEEK_API_KEY is configured and run_llm_analysis=true, BitCheck sends a structured prompt containing:

document text excerpt
metadata summary
QR summary
field extraction results
heuristic risk signals

The prompt instructs DeepSeek to return JSON only, avoid certainty claims, avoid inventing external facts, avoid browsing the web, and mark external verification needs when official issuer confirmation is required.

DeepSeek is used as a context and reconciliation layer, not as absolute proof. Its inferred document type can refine field extraction, contextualize keyword hits, and help the trust scorer resolve conflicts between local modules. Local validation, metadata checks, OCR, QR analysis, forensic signals, and trust scoring still run.

Operation Without DeepSeek

The service works without DeepSeek. If DEEPSEEK_API_KEY is missing:

local validation still runs
PDF/image processing still runs
metadata, OCR, QR, forensic, field, content-risk, trust scoring, and report building still run
deepseek_analysis.used is false
the response includes a warning that DeepSeek reasoning was skipped

Local Setup

cd /mnt/c/Users/Admin/Desktop/bitcheck-document/bitcheck-document-service
python3 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt

If you use uv:

uv venv .venv
uv pip install --python .venv/bin/python -r requirements.txt

For OCR support, install the Tesseract binary on the host or use the Docker image, which installs it.

Environment Variables

See .env.example.

DEEPSEEK_API_KEY=your_deepseek_api_key_here
DEEPSEEK_BASE_URL=https://api.deepseek.com
DEEPSEEK_MODEL=deepseek-chat
MAX_UPLOAD_MB=20
MAX_PDF_PAGES=5
LOG_LEVEL=INFO

DEEPSEEK_API_KEY is optional. Do not commit real keys.

Running The App

uvicorn main:app --host 0.0.0.0 --port 7860

Health check:

curl http://localhost:7860/health

Generated output files are served from:

http://localhost:7860/outputs/{filename}

Example Curl Requests

Verify a PDF:

curl -X POST http://localhost:7860/verify/document \
  -F "file=@sample.pdf" \
  -F "document_type=general" \
  -F "run_ocr=true" \
  -F "run_forensics=true" \
  -F "run_qr=true" \
  -F "run_live_qr_check=false" \
  -F "run_llm_analysis=true" \
  -F "max_pages=5"

Verify an image without OCR:

curl -X POST http://localhost:7860/verify/document \
  -F "file=@sample.png" \
  -F "run_ocr=false" \
  -F "run_forensics=true" \
  -F "run_qr=true"

Example Response

{
  "verification_id": "6a6e7b6f-4df4-4d4f-9318-59b01f55f970",
  "service": "BitCheck",
  "file_type": "document",
  "status": "completed_with_warnings",
  "processing_time_ms": 3244,
  "input": {
    "document_type": "general",
    "run_ocr": true,
    "run_forensics": true,
    "run_qr": true,
    "run_live_qr_check": false,
    "run_llm_analysis": true,
    "max_pages": 5,
    "page_count": 1,
    "pages_processed": 1
  },
  "file_validation": {
    "valid": true,
    "original_filename": "sample.pdf",
    "stored_filename": "generated-name.pdf",
    "stored_path": "uploads/generated-name.pdf",
    "sha256": "hash",
    "mime_type": "application/pdf",
    "extension": ".pdf",
    "file_size_bytes": 12345,
    "warnings": []
  },
  "metadata": {
    "checked": true,
    "metadata_found": false,
    "metadata_risk_score": 0.0,
    "flags": [],
    "warnings": ["No metadata found. This is a low-risk signal, not proof of authenticity."]
  },
  "fields": {
    "checked": true,
    "document_type": "certificate",
    "extracted_fields": {},
    "missing_expected_fields": [],
    "field_confidence": 0.72,
    "field_risk_score": 0.21,
    "field_flags": [],
    "warnings": []
  },
  "content_risk": {
    "checked": true,
    "fraud_risk_score": 0.0,
    "ai_generated_text_likelihood": 0.0,
    "suspicious_claims": [],
    "signals": [],
    "summary": "No high-risk content wording was detected by heuristic checks.",
    "warnings": []
  },
  "deepseek_analysis": {
    "used": false,
    "model": "deepseek-chat",
    "document_type_inferred": null,
    "summary": "",
    "external_verification_required": true,
    "warnings": ["DeepSeek API key is not configured; LLM reasoning was skipped."]
  },
  "trust": {
    "trust_score": 59,
    "risk_score": 0.12,
    "risk_level": "Suspicious",
    "decision": "review",
    "available_modules": ["metadata_risk", "pdf_structure_risk"],
    "applied_overrides": ["Too little evidence was available for a confident automated decision."],
    "evidence_count": 2
  },
  "risk_flags": [],
  "recommended_actions": ["Verify the document directly with the issuing authority or official portal."],
  "limitations": ["BitCheck provides a risk-based estimate, not legal proof of forgery or authenticity."],
  "warnings": []
}

The actual response includes all detailed module sections.

Hugging Face Spaces Deployment Guide

Create a new Hugging Face Space.
Select Docker as the Space SDK.
Upload this repository content to the Space.
In Space settings, add optional secrets:
- DEEPSEEK_API_KEY
- DEEPSEEK_BASE_URL
- DEEPSEEK_MODEL
Keep the exposed port as 7860.
The Dockerfile installs Python dependencies, Tesseract, and runtime libraries.
Hugging Face will build and start the app with:

uvicorn main:app --host 0.0.0.0 --port 7860

Test the deployed Space:

curl https://YOUR-SPACE-URL/health

Submit a sample document to:

POST https://YOUR-SPACE-URL/verify/document

Testing Instructions

Run syntax compilation:

python -m compileall .

Run tests:

pytest -q

The test suite covers validators, PDF/image processors, metadata analysis, QR URL analysis, live verifier helper behavior, field extraction, content risk, DeepSeek JSON parsing, trust scoring, context-aware long-form document handling, and route-level document verification smoke tests.

Limitations

BitCheck provides a risk-based estimate, not legal proof of forgery or authenticity.
Missing metadata does not prove a document is fake.
Editing software metadata does not automatically prove manipulation.
OCR may be inaccurate on low-quality scans.
QR code detection does not mean the linked source is authentic unless externally verified.
QR URLs are analyzed structurally but are not opened or browsed.
Forensic visual analysis is not court-grade evidence.
DeepSeek analysis does not perform live web or issuer database verification.
High-stakes documents should be manually verified with the issuing authority.

Future Improvements

live issuer verification
school/company database verification
official certificate number lookup
QR destination live validation
digital signature validation
C2PA document provenance
template matching per institution
logo verification
stamp/signature detection model
document layout transformer
Supabase storage and audit history

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference