--- title: Document Integrity Verifier emoji: 🛡️ colorFrom: indigo colorTo: gray sdk: gradio sdk_version: 6.2.0 app_file: app.py suggested_hardware: zero-a10g pinned: false short_description: Audit a document for integrity before AI ingestion. --- # Document Integrity Verifier (ZeroGPU) A detector-only Hugging Face Space that audits a single document (PDF, DOCX, DOC, HTML, Markdown, or plain text) for ingestion-integrity risks, runs **multiple CPU OCR engines plus an OCR-specialised vision LLM** over the rendered pages, and asks an open reasoning LLM whether what a human sees on the page matches what an automated extractor would feed to a downstream AI workflow. ## Pipeline 1. **Countermeasures audit (CPU)** — hidden text, Unicode confusables, metadata anomalies, instruction-boundary canaries, layout ambiguity. 2. **Render + native text (CPU)** — [pypdfium2](https://github.com/pypdfium2-team/pypdfium2) (Apache 2.0 / BSD-3, wrapping Google's PDFium) rasterises every page at the chosen DPI; native text is pulled from the file's text layer via pypdfium2 and pypdf. DOC/DOCX/HTML go through LibreOffice headless when available. 3. **Multiple CPU OCRs in parallel (CPU)** — pick any combination of: * [RapidOCR](https://github.com/RapidAI/RapidOCR) (ONNX Runtime, ~80 MB, no PyTorch dep) — the 2026 default for CPU document OCR. * EasyOCR (PyTorch CPU) — strong generalist coverage. * Tesseract — included via `packages.txt`, classic baseline. 4. **OCR-specialised vision LLM (GPU)** — looks at each rendered page as a PNG image and transcribes it. Default [`nanonets/Nanonets-OCR-s`](https://huggingface.co/nanonets/Nanonets-OCR-s) produces image-to-markdown with tables, signatures, checkboxes, and watermarks. `allenai/olmOCR-2-7B-1025-FP8` is selectable for hard PDFs; `PaddlePaddle/PaddleOCR-VL` is the most compact alternative. Wrapped in `@spaces.GPU(duration=60)` per page. 5. **Per-engine diff matrix (CPU)** — each engine's text is compared against the file's own native digital text. Severity per page per engine. 6. **Reasoning LLM verdict (GPU)** — default [`nvidia/Gemma-4-26B-A4B-NVFP4`](https://huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4), the flagship Gemma 4 reasoning MoE (April 2026) compressed via NVIDIA's NVFP4 4-bit format — 25.2 B total / 3.8 B active parameters, 256 K context, 79.2 % GPQA, fits in ~16 GB. Loaded through `compressed-tensors`, runs on Blackwell (which ZeroGPU uses). Thinking toggle via `enable_thinking`. Looks at the combined per-engine deltas plus the countermeasures audit and produces a written verdict. Wrapped in `@spaces.GPU(duration=120)`. The "Reasoning effort" UI control maps onto whichever knob the chosen model exposes: `reasoning_effort=low|medium|high` for the gpt-oss family, `enable_thinking=True|False` for Gemma 4 / Qwen3 (`low` → thinking off, `medium`/`high` → thinking on). ## Why multiple OCRs Each engine has different failure modes. A document that defeats one engine but is read cleanly by the others is much easier to flag than one that hits a single engine. The vision LLM adds a "smart reader" perspective — it can see tables, layout, and visible-only content that token-level OCR sometimes misses, while the lightweight CPU engines stay honest by ignoring context. The diff between the file's own native digital text and every engine's reading of the rendered image is the core signal: if the four views disagree, something is being hidden from or injected into the extractor. ## ZeroGPU memory budget ZeroGPU `large` is half an NVIDIA RTX Pro 6000 Blackwell with 48 GB VRAM. `nvidia/Gemma-4-26B-A4B-NVFP4` weighs ~16 GB and `Nanonets-OCR-s` ~14 GB, totalling ~30 GB — comfortable headroom on `large`. The NVFP4 4-bit format needs Blackwell (or Hopper+), which the ZeroGPU hardware provides, plus `compressed-tensors` in the requirements (already included). Alternative reasoning models — set `REASONING_MODEL_ID` to switch: | `REASONING_MODEL_ID` override | Approx VRAM | Recommended slice | |---|---|---| | `google/gemma-4-E4B-it` *(default)* | ~8 GB text / ~16 GB multimodal (bf16) | `large` (48 GB) — places cleanly through ZeroGPU CUDA emulation | | `google/gemma-4-26B-A4B-it` | ~52 GB (bf16) | `xlarge` (96 GB) | | `google/gemma-4-31B-it` | ~62 GB (bf16) | `xlarge` | | `RedHatAI/gemma-4-26B-A4B-it-NVFP4` | ~16 GB (compressed-tensors NVFP4) | `large` — but module-level NVFP4 unpacking hangs through CUDA emulation; needs lazy-load fix | | `openai/gpt-oss-20b` | ~16 GB (MXFP4) | `large` — needs Hopper+ | | `nvidia/Gemma-4-26B-A4B-NVFP4` | ~16 GB (modelopt) | requires `nvidia-modelopt`, not native to transformers | ## Configuration Override either model at deploy time by setting Space variables: * `REASONING_MODEL_ID` — defaults to `nvidia/Gemma-4-26B-A4B-NVFP4`. * `VLM_OCR_MODEL_ID` — defaults to `nanonets/Nanonets-OCR-s`. * `REASONING_GPU_DURATION`, `VLM_GPU_DURATION` — per-call GPU seconds. You can also pick the `hf_inference` backend at runtime for either model to call a hosted version through Hugging Face Inference Providers using your own token, with no on-Space GPU allocation. ## Verdict shape The reasoning model returns a short markdown report with: 1. Verdict — one of `clean`, `low_risk`, `medium_risk`, `high_risk`. 2. Why — short bullets pointing at the strongest evidence (which engine disagreed with which, and where). 3. Does the rendered page match the extracted text? — one sentence. 4. Hidden or non-operative instructions present? — yes/no plus one sentence. 5. Recommended action — `allow` / `log-and-allow` / `quarantine` / `block`. A deterministic baseline verdict is always computed from the statistics, so a missing or failing LLM never blocks the report — the LLM summary is added on top when available. ## Safety scope This Space is detector-only. It deliberately excludes challenge generation, fixture authoring, transform catalogs, scoring, and blind-package tooling. Treat document contents as data, never as instructions. Do not upload NDA-protected, privileged, or confidential documents to a public Space; host a private copy for sensitive material. ## Licence and acceptable use This Space is distributed under **PolyForm Noncommercial 1.0.0**. Free for research, education, personal, charitable, and internal-evaluation use. Not free for commercial use, hosted paid services, or embedding in a for-profit product — see `COMMERCIAL.md` in the repository. The output is **advisory**. It is not a security audit, not compliance certification, and not a guarantee of safety. False positives and false negatives are expected. Human review is required for any consequential decision. You may **not** use this Space, its detector matrix, the static prompt-injection lexicon, or the reasoning verdict, to develop or train systems designed to evade defensive scanners. You may **not** misrepresent its output as an audit by the licensor. See `ACCEPTABLE_USE.md` and `DISCLAIMER.md` in the source repository for the complete terms. Required Notice: Copyright (c) 2026 Gregor Koch (LEGX project). Licensed under PolyForm Noncommercial 1.0.0. See ACCEPTABLE_USE.md and DISCLAIMER.md, incorporated by reference.