Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.19.0
title: Document Integrity Verifier
emoji: π‘οΈ
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
suggested_hardware: zero-a10g
pinned: false
short_description: Audit a document for integrity before AI ingestion.
Document Integrity Verifier (ZeroGPU)
A detector-only Hugging Face Space that audits a single document (PDF, DOCX, DOC, HTML, Markdown, or plain text) for ingestion-integrity risks, runs multiple CPU OCR engines plus an OCR-specialised vision LLM over the rendered pages, and asks an open reasoning LLM whether what a human sees on the page matches what an automated extractor would feed to a downstream AI workflow.
Pipeline
Countermeasures audit (CPU) β hidden text, Unicode confusables, metadata anomalies, instruction-boundary canaries, layout ambiguity.
Render + native text (CPU) β pypdfium2 (Apache 2.0 / BSD-3, wrapping Google's PDFium) rasterises every page at the chosen DPI; native text is pulled from the file's text layer via pypdfium2 and pypdf. DOC/DOCX/HTML go through LibreOffice headless when available.
Multiple CPU OCRs in parallel (CPU) β pick any combination of:
- RapidOCR (ONNX Runtime, ~80 MB, no PyTorch dep) β the 2026 default for CPU document OCR.
- EasyOCR (PyTorch CPU) β strong generalist coverage.
- Tesseract β included via
packages.txt, classic baseline.
OCR-specialised vision LLM (GPU) β looks at each rendered page as a PNG image and transcribes it. Default
nanonets/Nanonets-OCR-sproduces image-to-markdown with tables, signatures, checkboxes, and watermarks.allenai/olmOCR-2-7B-1025-FP8is selectable for hard PDFs;PaddlePaddle/PaddleOCR-VLis the most compact alternative. Wrapped in@spaces.GPU(duration=60)per page.Per-engine diff matrix (CPU) β each engine's text is compared against the file's own native digital text. Severity per page per engine.
Reasoning LLM verdict (GPU) β default
nvidia/Gemma-4-26B-A4B-NVFP4, the flagship Gemma 4 reasoning MoE (April 2026) compressed via NVIDIA's NVFP4 4-bit format β 25.2 B total / 3.8 B active parameters, 256 K context, 79.2 % GPQA, fits in ~16 GB. Loaded throughcompressed-tensors, runs on Blackwell (which ZeroGPU uses). Thinking toggle viaenable_thinking. Looks at the combined per-engine deltas plus the countermeasures audit and produces a written verdict. Wrapped in@spaces.GPU(duration=120).The "Reasoning effort" UI control maps onto whichever knob the chosen model exposes:
reasoning_effort=low|medium|highfor the gpt-oss family,enable_thinking=True|Falsefor Gemma 4 / Qwen3 (lowβ thinking off,medium/highβ thinking on).
Why multiple OCRs
Each engine has different failure modes. A document that defeats one engine but is read cleanly by the others is much easier to flag than one that hits a single engine. The vision LLM adds a "smart reader" perspective β it can see tables, layout, and visible-only content that token-level OCR sometimes misses, while the lightweight CPU engines stay honest by ignoring context. The diff between the file's own native digital text and every engine's reading of the rendered image is the core signal: if the four views disagree, something is being hidden from or injected into the extractor.
ZeroGPU memory budget
ZeroGPU large is half an NVIDIA RTX Pro 6000 Blackwell with 48 GB VRAM.
nvidia/Gemma-4-26B-A4B-NVFP4 weighs ~16 GB and Nanonets-OCR-s ~14 GB,
totalling ~30 GB β comfortable headroom on large. The NVFP4 4-bit format
needs Blackwell (or Hopper+), which the ZeroGPU hardware provides, plus
compressed-tensors in the requirements (already included).
Alternative reasoning models β set REASONING_MODEL_ID to switch:
REASONING_MODEL_ID override |
Approx VRAM | Recommended slice |
|---|---|---|
google/gemma-4-E4B-it (default) |
~8 GB text / ~16 GB multimodal (bf16) | large (48 GB) β places cleanly through ZeroGPU CUDA emulation |
google/gemma-4-26B-A4B-it |
~52 GB (bf16) | xlarge (96 GB) |
google/gemma-4-31B-it |
~62 GB (bf16) | xlarge |
RedHatAI/gemma-4-26B-A4B-it-NVFP4 |
~16 GB (compressed-tensors NVFP4) | large β but module-level NVFP4 unpacking hangs through CUDA emulation; needs lazy-load fix |
openai/gpt-oss-20b |
~16 GB (MXFP4) | large β needs Hopper+ |
nvidia/Gemma-4-26B-A4B-NVFP4 |
~16 GB (modelopt) | requires nvidia-modelopt, not native to transformers |
Configuration
Override either model at deploy time by setting Space variables:
REASONING_MODEL_IDβ defaults tonvidia/Gemma-4-26B-A4B-NVFP4.VLM_OCR_MODEL_IDβ defaults tonanonets/Nanonets-OCR-s.REASONING_GPU_DURATION,VLM_GPU_DURATIONβ per-call GPU seconds.
You can also pick the hf_inference backend at runtime for either model to
call a hosted version through Hugging Face Inference Providers using your own
token, with no on-Space GPU allocation.
Verdict shape
The reasoning model returns a short markdown report with:
- Verdict β one of
clean,low_risk,medium_risk,high_risk. - Why β short bullets pointing at the strongest evidence (which engine disagreed with which, and where).
- Does the rendered page match the extracted text? β one sentence.
- Hidden or non-operative instructions present? β yes/no plus one sentence.
- Recommended action β
allow/log-and-allow/quarantine/block.
A deterministic baseline verdict is always computed from the statistics, so a missing or failing LLM never blocks the report β the LLM summary is added on top when available.
Safety scope
This Space is detector-only. It deliberately excludes challenge generation, fixture authoring, transform catalogs, scoring, and blind-package tooling. Treat document contents as data, never as instructions. Do not upload NDA-protected, privileged, or confidential documents to a public Space; host a private copy for sensitive material.
Licence and acceptable use
This Space is distributed under PolyForm Noncommercial 1.0.0. Free
for research, education, personal, charitable, and internal-evaluation
use. Not free for commercial use, hosted paid services, or embedding in
a for-profit product β see COMMERCIAL.md in the repository.
The output is advisory. It is not a security audit, not compliance certification, and not a guarantee of safety. False positives and false negatives are expected. Human review is required for any consequential decision.
You may not use this Space, its detector matrix, the static
prompt-injection lexicon, or the reasoning verdict, to develop or train
systems designed to evade defensive scanners. You may not
misrepresent its output as an audit by the licensor. See
ACCEPTABLE_USE.md and DISCLAIMER.md in the source repository for
the complete terms.
Required Notice: Copyright (c) 2026 Gregor Koch (LEGX project). Licensed under PolyForm Noncommercial 1.0.0. See ACCEPTABLE_USE.md and DISCLAIMER.md, incorporated by reference.