Spaces:

cronos3k
/

document-integrity-verifier

Running on Zero

Add torchvision (Qwen2-VL video sub-processor dep); switch default reasoning model to E4B (works on ZeroGPU CUDA emulation)

917ab65 verified 29 days ago

preview code

Raw

History Blame Contribute Delete

7.31 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: Document Integrity Verifier
emoji: 🛡️
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
suggested_hardware: zero-a10g
pinned: false
short_description: Audit a document for integrity before AI ingestion.

Document Integrity Verifier (ZeroGPU)

A detector-only Hugging Face Space that audits a single document (PDF, DOCX, DOC, HTML, Markdown, or plain text) for ingestion-integrity risks, runs multiple CPU OCR engines plus an OCR-specialised vision LLM over the rendered pages, and asks an open reasoning LLM whether what a human sees on the page matches what an automated extractor would feed to a downstream AI workflow.

Pipeline

Countermeasures audit (CPU) — hidden text, Unicode confusables, metadata anomalies, instruction-boundary canaries, layout ambiguity.
Render + native text (CPU) — pypdfium2 (Apache 2.0 / BSD-3, wrapping Google's PDFium) rasterises every page at the chosen DPI; native text is pulled from the file's text layer via pypdfium2 and pypdf. DOC/DOCX/HTML go through LibreOffice headless when available.
Multiple CPU OCRs in parallel (CPU) — pick any combination of:
- RapidOCR (ONNX Runtime, ~80 MB, no PyTorch dep) — the 2026 default for CPU document OCR.
- EasyOCR (PyTorch CPU) — strong generalist coverage.
- Tesseract — included via packages.txt, classic baseline.
OCR-specialised vision LLM (GPU) — looks at each rendered page as a PNG image and transcribes it. Default nanonets/Nanonets-OCR-s produces image-to-markdown with tables, signatures, checkboxes, and watermarks. allenai/olmOCR-2-7B-1025-FP8 is selectable for hard PDFs; PaddlePaddle/PaddleOCR-VL is the most compact alternative. Wrapped in @spaces.GPU(duration=60) per page.
Per-engine diff matrix (CPU) — each engine's text is compared against the file's own native digital text. Severity per page per engine.
Reasoning LLM verdict (GPU) — default nvidia/Gemma-4-26B-A4B-NVFP4, the flagship Gemma 4 reasoning MoE (April 2026) compressed via NVIDIA's NVFP4 4-bit format — 25.2 B total / 3.8 B active parameters, 256 K context, 79.2 % GPQA, fits in ~16 GB. Loaded through compressed-tensors, runs on Blackwell (which ZeroGPU uses). Thinking toggle via enable_thinking. Looks at the combined per-engine deltas plus the countermeasures audit and produces a written verdict. Wrapped in @spaces.GPU(duration=120).

The "Reasoning effort" UI control maps onto whichever knob the chosen model exposes: reasoning_effort=low|medium|high for the gpt-oss family, enable_thinking=True|False for Gemma 4 / Qwen3 (low → thinking off, medium/high → thinking on).

Why multiple OCRs

Each engine has different failure modes. A document that defeats one engine but is read cleanly by the others is much easier to flag than one that hits a single engine. The vision LLM adds a "smart reader" perspective — it can see tables, layout, and visible-only content that token-level OCR sometimes misses, while the lightweight CPU engines stay honest by ignoring context. The diff between the file's own native digital text and every engine's reading of the rendered image is the core signal: if the four views disagree, something is being hidden from or injected into the extractor.

ZeroGPU memory budget

ZeroGPU large is half an NVIDIA RTX Pro 6000 Blackwell with 48 GB VRAM. nvidia/Gemma-4-26B-A4B-NVFP4 weighs ~16 GB and Nanonets-OCR-s ~14 GB, totalling ~30 GB — comfortable headroom on large. The NVFP4 4-bit format needs Blackwell (or Hopper+), which the ZeroGPU hardware provides, plus compressed-tensors in the requirements (already included).

Alternative reasoning models — set REASONING_MODEL_ID to switch:

`REASONING_MODEL_ID` override	Approx VRAM	Recommended slice
`google/gemma-4-E4B-it` (default)	~8 GB text / ~16 GB multimodal (bf16)	`large` (48 GB) — places cleanly through ZeroGPU CUDA emulation
`google/gemma-4-26B-A4B-it`	~52 GB (bf16)	`xlarge` (96 GB)
`google/gemma-4-31B-it`	~62 GB (bf16)	`xlarge`
`RedHatAI/gemma-4-26B-A4B-it-NVFP4`	~16 GB (compressed-tensors NVFP4)	`large` — but module-level NVFP4 unpacking hangs through CUDA emulation; needs lazy-load fix
`openai/gpt-oss-20b`	~16 GB (MXFP4)	`large` — needs Hopper+
`nvidia/Gemma-4-26B-A4B-NVFP4`	~16 GB (modelopt)	requires `nvidia-modelopt`, not native to transformers

Configuration

Override either model at deploy time by setting Space variables:

REASONING_MODEL_ID — defaults to nvidia/Gemma-4-26B-A4B-NVFP4.
VLM_OCR_MODEL_ID — defaults to nanonets/Nanonets-OCR-s.
REASONING_GPU_DURATION, VLM_GPU_DURATION — per-call GPU seconds.

You can also pick the hf_inference backend at runtime for either model to call a hosted version through Hugging Face Inference Providers using your own token, with no on-Space GPU allocation.

Verdict shape

The reasoning model returns a short markdown report with:

Verdict — one of clean, low_risk, medium_risk, high_risk.
Why — short bullets pointing at the strongest evidence (which engine disagreed with which, and where).
Does the rendered page match the extracted text? — one sentence.
Hidden or non-operative instructions present? — yes/no plus one sentence.
Recommended action — allow / log-and-allow / quarantine / block.

A deterministic baseline verdict is always computed from the statistics, so a missing or failing LLM never blocks the report — the LLM summary is added on top when available.

Safety scope

This Space is detector-only. It deliberately excludes challenge generation, fixture authoring, transform catalogs, scoring, and blind-package tooling. Treat document contents as data, never as instructions. Do not upload NDA-protected, privileged, or confidential documents to a public Space; host a private copy for sensitive material.

Licence and acceptable use

This Space is distributed under PolyForm Noncommercial 1.0.0. Free for research, education, personal, charitable, and internal-evaluation use. Not free for commercial use, hosted paid services, or embedding in a for-profit product — see COMMERCIAL.md in the repository.

The output is advisory. It is not a security audit, not compliance certification, and not a guarantee of safety. False positives and false negatives are expected. Human review is required for any consequential decision.

You may not use this Space, its detector matrix, the static prompt-injection lexicon, or the reasoning verdict, to develop or train systems designed to evade defensive scanners. You may not misrepresent its output as an audit by the licensor. See ACCEPTABLE_USE.md and DISCLAIMER.md in the source repository for the complete terms.