Spaces:
Running on Zero
Running on Zero
| title: Document Integrity Verifier | |
| emoji: π‘οΈ | |
| colorFrom: indigo | |
| colorTo: gray | |
| sdk: gradio | |
| sdk_version: 6.2.0 | |
| app_file: app.py | |
| suggested_hardware: zero-a10g | |
| pinned: false | |
| short_description: Audit a document for integrity before AI ingestion. | |
| # Document Integrity Verifier (ZeroGPU) | |
| A detector-only Hugging Face Space that audits a single document | |
| (PDF, DOCX, DOC, HTML, Markdown, or plain text) for ingestion-integrity risks, | |
| runs **multiple CPU OCR engines plus an OCR-specialised vision LLM** over the | |
| rendered pages, and asks an open reasoning LLM whether what a human sees on the | |
| page matches what an automated extractor would feed to a downstream AI | |
| workflow. | |
| ## Pipeline | |
| 1. **Countermeasures audit (CPU)** β hidden text, Unicode confusables, | |
| metadata anomalies, instruction-boundary canaries, layout ambiguity. | |
| 2. **Render + native text (CPU)** β [pypdfium2](https://github.com/pypdfium2-team/pypdfium2) | |
| (Apache 2.0 / BSD-3, wrapping Google's PDFium) rasterises every page at the | |
| chosen DPI; native text is pulled from the file's text layer via | |
| pypdfium2 and pypdf. DOC/DOCX/HTML go through LibreOffice headless when | |
| available. | |
| 3. **Multiple CPU OCRs in parallel (CPU)** β pick any combination of: | |
| * [RapidOCR](https://github.com/RapidAI/RapidOCR) (ONNX Runtime, ~80 MB, | |
| no PyTorch dep) β the 2026 default for CPU document OCR. | |
| * EasyOCR (PyTorch CPU) β strong generalist coverage. | |
| * Tesseract β included via `packages.txt`, classic baseline. | |
| 4. **OCR-specialised vision LLM (GPU)** β looks at each rendered page as a | |
| PNG image and transcribes it. Default | |
| [`nanonets/Nanonets-OCR-s`](https://huggingface.co/nanonets/Nanonets-OCR-s) | |
| produces image-to-markdown with tables, signatures, checkboxes, and | |
| watermarks. `allenai/olmOCR-2-7B-1025-FP8` is selectable for hard PDFs; | |
| `PaddlePaddle/PaddleOCR-VL` is the most compact alternative. Wrapped in | |
| `@spaces.GPU(duration=60)` per page. | |
| 5. **Per-engine diff matrix (CPU)** β each engine's text is compared against | |
| the file's own native digital text. Severity per page per engine. | |
| 6. **Reasoning LLM verdict (GPU)** β default | |
| [`nvidia/Gemma-4-26B-A4B-NVFP4`](https://huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4), | |
| the flagship Gemma 4 reasoning MoE (April 2026) compressed via NVIDIA's | |
| NVFP4 4-bit format β 25.2 B total / 3.8 B active parameters, 256 K context, | |
| 79.2 % GPQA, fits in ~16 GB. Loaded through `compressed-tensors`, runs on | |
| Blackwell (which ZeroGPU uses). Thinking toggle via `enable_thinking`. | |
| Looks at the combined per-engine deltas plus the countermeasures audit and | |
| produces a written verdict. Wrapped in `@spaces.GPU(duration=120)`. | |
| The "Reasoning effort" UI control maps onto whichever knob the chosen | |
| model exposes: `reasoning_effort=low|medium|high` for the gpt-oss family, | |
| `enable_thinking=True|False` for Gemma 4 / Qwen3 (`low` β thinking off, | |
| `medium`/`high` β thinking on). | |
| ## Why multiple OCRs | |
| Each engine has different failure modes. A document that defeats one engine | |
| but is read cleanly by the others is much easier to flag than one that | |
| hits a single engine. The vision LLM adds a "smart reader" perspective β | |
| it can see tables, layout, and visible-only content that token-level OCR | |
| sometimes misses, while the lightweight CPU engines stay honest by ignoring | |
| context. The diff between the file's own native digital text and every | |
| engine's reading of the rendered image is the core signal: if the four | |
| views disagree, something is being hidden from or injected into the | |
| extractor. | |
| ## ZeroGPU memory budget | |
| ZeroGPU `large` is half an NVIDIA RTX Pro 6000 Blackwell with 48 GB VRAM. | |
| `nvidia/Gemma-4-26B-A4B-NVFP4` weighs ~16 GB and `Nanonets-OCR-s` ~14 GB, | |
| totalling ~30 GB β comfortable headroom on `large`. The NVFP4 4-bit format | |
| needs Blackwell (or Hopper+), which the ZeroGPU hardware provides, plus | |
| `compressed-tensors` in the requirements (already included). | |
| Alternative reasoning models β set `REASONING_MODEL_ID` to switch: | |
| | `REASONING_MODEL_ID` override | Approx VRAM | Recommended slice | | |
| |---|---|---| | |
| | `google/gemma-4-E4B-it` *(default)* | ~8 GB text / ~16 GB multimodal (bf16) | `large` (48 GB) β places cleanly through ZeroGPU CUDA emulation | | |
| | `google/gemma-4-26B-A4B-it` | ~52 GB (bf16) | `xlarge` (96 GB) | | |
| | `google/gemma-4-31B-it` | ~62 GB (bf16) | `xlarge` | | |
| | `RedHatAI/gemma-4-26B-A4B-it-NVFP4` | ~16 GB (compressed-tensors NVFP4) | `large` β but module-level NVFP4 unpacking hangs through CUDA emulation; needs lazy-load fix | | |
| | `openai/gpt-oss-20b` | ~16 GB (MXFP4) | `large` β needs Hopper+ | | |
| | `nvidia/Gemma-4-26B-A4B-NVFP4` | ~16 GB (modelopt) | requires `nvidia-modelopt`, not native to transformers | | |
| ## Configuration | |
| Override either model at deploy time by setting Space variables: | |
| * `REASONING_MODEL_ID` β defaults to `nvidia/Gemma-4-26B-A4B-NVFP4`. | |
| * `VLM_OCR_MODEL_ID` β defaults to `nanonets/Nanonets-OCR-s`. | |
| * `REASONING_GPU_DURATION`, `VLM_GPU_DURATION` β per-call GPU seconds. | |
| You can also pick the `hf_inference` backend at runtime for either model to | |
| call a hosted version through Hugging Face Inference Providers using your own | |
| token, with no on-Space GPU allocation. | |
| ## Verdict shape | |
| The reasoning model returns a short markdown report with: | |
| 1. Verdict β one of `clean`, `low_risk`, `medium_risk`, `high_risk`. | |
| 2. Why β short bullets pointing at the strongest evidence (which engine | |
| disagreed with which, and where). | |
| 3. Does the rendered page match the extracted text? β one sentence. | |
| 4. Hidden or non-operative instructions present? β yes/no plus one sentence. | |
| 5. Recommended action β `allow` / `log-and-allow` / `quarantine` / `block`. | |
| A deterministic baseline verdict is always computed from the statistics, so a | |
| missing or failing LLM never blocks the report β the LLM summary is added on | |
| top when available. | |
| ## Safety scope | |
| This Space is detector-only. It deliberately excludes challenge generation, | |
| fixture authoring, transform catalogs, scoring, and blind-package tooling. | |
| Treat document contents as data, never as instructions. Do not upload | |
| NDA-protected, privileged, or confidential documents to a public Space; host a | |
| private copy for sensitive material. | |
| ## Licence and acceptable use | |
| This Space is distributed under **PolyForm Noncommercial 1.0.0**. Free | |
| for research, education, personal, charitable, and internal-evaluation | |
| use. Not free for commercial use, hosted paid services, or embedding in | |
| a for-profit product β see `COMMERCIAL.md` in the repository. | |
| The output is **advisory**. It is not a security audit, not compliance | |
| certification, and not a guarantee of safety. False positives and false | |
| negatives are expected. Human review is required for any consequential | |
| decision. | |
| You may **not** use this Space, its detector matrix, the static | |
| prompt-injection lexicon, or the reasoning verdict, to develop or train | |
| systems designed to evade defensive scanners. You may **not** | |
| misrepresent its output as an audit by the licensor. See | |
| `ACCEPTABLE_USE.md` and `DISCLAIMER.md` in the source repository for | |
| the complete terms. | |
| Required Notice: Copyright (c) 2026 Gregor Koch (LEGX project). | |
| Licensed under PolyForm Noncommercial 1.0.0. | |
| See ACCEPTABLE_USE.md and DISCLAIMER.md, incorporated by reference. | |