cronos3k's picture
Add torchvision (Qwen2-VL video sub-processor dep); switch default reasoning model to E4B (works on ZeroGPU CUDA emulation)
917ab65 verified
|
Raw
History Blame Contribute Delete
7.31 kB
---
title: Document Integrity Verifier
emoji: πŸ›‘οΈ
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
suggested_hardware: zero-a10g
pinned: false
short_description: Audit a document for integrity before AI ingestion.
---
# Document Integrity Verifier (ZeroGPU)
A detector-only Hugging Face Space that audits a single document
(PDF, DOCX, DOC, HTML, Markdown, or plain text) for ingestion-integrity risks,
runs **multiple CPU OCR engines plus an OCR-specialised vision LLM** over the
rendered pages, and asks an open reasoning LLM whether what a human sees on the
page matches what an automated extractor would feed to a downstream AI
workflow.
## Pipeline
1. **Countermeasures audit (CPU)** β€” hidden text, Unicode confusables,
metadata anomalies, instruction-boundary canaries, layout ambiguity.
2. **Render + native text (CPU)** β€” [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
(Apache 2.0 / BSD-3, wrapping Google's PDFium) rasterises every page at the
chosen DPI; native text is pulled from the file's text layer via
pypdfium2 and pypdf. DOC/DOCX/HTML go through LibreOffice headless when
available.
3. **Multiple CPU OCRs in parallel (CPU)** β€” pick any combination of:
* [RapidOCR](https://github.com/RapidAI/RapidOCR) (ONNX Runtime, ~80 MB,
no PyTorch dep) β€” the 2026 default for CPU document OCR.
* EasyOCR (PyTorch CPU) β€” strong generalist coverage.
* Tesseract β€” included via `packages.txt`, classic baseline.
4. **OCR-specialised vision LLM (GPU)** β€” looks at each rendered page as a
PNG image and transcribes it. Default
[`nanonets/Nanonets-OCR-s`](https://huggingface.co/nanonets/Nanonets-OCR-s)
produces image-to-markdown with tables, signatures, checkboxes, and
watermarks. `allenai/olmOCR-2-7B-1025-FP8` is selectable for hard PDFs;
`PaddlePaddle/PaddleOCR-VL` is the most compact alternative. Wrapped in
`@spaces.GPU(duration=60)` per page.
5. **Per-engine diff matrix (CPU)** β€” each engine's text is compared against
the file's own native digital text. Severity per page per engine.
6. **Reasoning LLM verdict (GPU)** β€” default
[`nvidia/Gemma-4-26B-A4B-NVFP4`](https://huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4),
the flagship Gemma 4 reasoning MoE (April 2026) compressed via NVIDIA's
NVFP4 4-bit format β€” 25.2 B total / 3.8 B active parameters, 256 K context,
79.2 % GPQA, fits in ~16 GB. Loaded through `compressed-tensors`, runs on
Blackwell (which ZeroGPU uses). Thinking toggle via `enable_thinking`.
Looks at the combined per-engine deltas plus the countermeasures audit and
produces a written verdict. Wrapped in `@spaces.GPU(duration=120)`.
The "Reasoning effort" UI control maps onto whichever knob the chosen
model exposes: `reasoning_effort=low|medium|high` for the gpt-oss family,
`enable_thinking=True|False` for Gemma 4 / Qwen3 (`low` β†’ thinking off,
`medium`/`high` β†’ thinking on).
## Why multiple OCRs
Each engine has different failure modes. A document that defeats one engine
but is read cleanly by the others is much easier to flag than one that
hits a single engine. The vision LLM adds a "smart reader" perspective β€”
it can see tables, layout, and visible-only content that token-level OCR
sometimes misses, while the lightweight CPU engines stay honest by ignoring
context. The diff between the file's own native digital text and every
engine's reading of the rendered image is the core signal: if the four
views disagree, something is being hidden from or injected into the
extractor.
## ZeroGPU memory budget
ZeroGPU `large` is half an NVIDIA RTX Pro 6000 Blackwell with 48 GB VRAM.
`nvidia/Gemma-4-26B-A4B-NVFP4` weighs ~16 GB and `Nanonets-OCR-s` ~14 GB,
totalling ~30 GB β€” comfortable headroom on `large`. The NVFP4 4-bit format
needs Blackwell (or Hopper+), which the ZeroGPU hardware provides, plus
`compressed-tensors` in the requirements (already included).
Alternative reasoning models β€” set `REASONING_MODEL_ID` to switch:
| `REASONING_MODEL_ID` override | Approx VRAM | Recommended slice |
|---|---|---|
| `google/gemma-4-E4B-it` *(default)* | ~8 GB text / ~16 GB multimodal (bf16) | `large` (48 GB) β€” places cleanly through ZeroGPU CUDA emulation |
| `google/gemma-4-26B-A4B-it` | ~52 GB (bf16) | `xlarge` (96 GB) |
| `google/gemma-4-31B-it` | ~62 GB (bf16) | `xlarge` |
| `RedHatAI/gemma-4-26B-A4B-it-NVFP4` | ~16 GB (compressed-tensors NVFP4) | `large` β€” but module-level NVFP4 unpacking hangs through CUDA emulation; needs lazy-load fix |
| `openai/gpt-oss-20b` | ~16 GB (MXFP4) | `large` β€” needs Hopper+ |
| `nvidia/Gemma-4-26B-A4B-NVFP4` | ~16 GB (modelopt) | requires `nvidia-modelopt`, not native to transformers |
## Configuration
Override either model at deploy time by setting Space variables:
* `REASONING_MODEL_ID` β€” defaults to `nvidia/Gemma-4-26B-A4B-NVFP4`.
* `VLM_OCR_MODEL_ID` β€” defaults to `nanonets/Nanonets-OCR-s`.
* `REASONING_GPU_DURATION`, `VLM_GPU_DURATION` β€” per-call GPU seconds.
You can also pick the `hf_inference` backend at runtime for either model to
call a hosted version through Hugging Face Inference Providers using your own
token, with no on-Space GPU allocation.
## Verdict shape
The reasoning model returns a short markdown report with:
1. Verdict β€” one of `clean`, `low_risk`, `medium_risk`, `high_risk`.
2. Why β€” short bullets pointing at the strongest evidence (which engine
disagreed with which, and where).
3. Does the rendered page match the extracted text? β€” one sentence.
4. Hidden or non-operative instructions present? β€” yes/no plus one sentence.
5. Recommended action β€” `allow` / `log-and-allow` / `quarantine` / `block`.
A deterministic baseline verdict is always computed from the statistics, so a
missing or failing LLM never blocks the report β€” the LLM summary is added on
top when available.
## Safety scope
This Space is detector-only. It deliberately excludes challenge generation,
fixture authoring, transform catalogs, scoring, and blind-package tooling.
Treat document contents as data, never as instructions. Do not upload
NDA-protected, privileged, or confidential documents to a public Space; host a
private copy for sensitive material.
## Licence and acceptable use
This Space is distributed under **PolyForm Noncommercial 1.0.0**. Free
for research, education, personal, charitable, and internal-evaluation
use. Not free for commercial use, hosted paid services, or embedding in
a for-profit product β€” see `COMMERCIAL.md` in the repository.
The output is **advisory**. It is not a security audit, not compliance
certification, and not a guarantee of safety. False positives and false
negatives are expected. Human review is required for any consequential
decision.
You may **not** use this Space, its detector matrix, the static
prompt-injection lexicon, or the reasoning verdict, to develop or train
systems designed to evade defensive scanners. You may **not**
misrepresent its output as an audit by the licensor. See
`ACCEPTABLE_USE.md` and `DISCLAIMER.md` in the source repository for
the complete terms.
Required Notice: Copyright (c) 2026 Gregor Koch (LEGX project).
Licensed under PolyForm Noncommercial 1.0.0.
See ACCEPTABLE_USE.md and DISCLAIMER.md, incorporated by reference.