Spaces:

cronos3k
/

document-integrity-verifier

Running on Zero

Add torchvision (Qwen2-VL video sub-processor dep); switch default reasoning model to E4B (works on ZeroGPU CUDA emulation)

917ab65 verified 29 days ago

preview code

Raw

History Blame Contribute Delete

7.31 kB

	---
	title: Document Integrity Verifier
	emoji: 🛡️
	colorFrom: indigo
	colorTo: gray
	sdk: gradio
	sdk_version: 6.2.0
	app_file: app.py
	suggested_hardware: zero-a10g
	pinned: false
	short_description: Audit a document for integrity before AI ingestion.
	---

	# Document Integrity Verifier (ZeroGPU)

	A detector-only Hugging Face Space that audits a single document
	(PDF, DOCX, DOC, HTML, Markdown, or plain text) for ingestion-integrity risks,
	runs multiple CPU OCR engines plus an OCR-specialised vision LLM over the
	rendered pages, and asks an open reasoning LLM whether what a human sees on the
	page matches what an automated extractor would feed to a downstream AI
	workflow.

	## Pipeline

	1. Countermeasures audit (CPU) — hidden text, Unicode confusables,
	metadata anomalies, instruction-boundary canaries, layout ambiguity.
	2. Render + native text (CPU) — [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
	(Apache 2.0 / BSD-3, wrapping Google's PDFium) rasterises every page at the
	chosen DPI; native text is pulled from the file's text layer via
	pypdfium2 and pypdf. DOC/DOCX/HTML go through LibreOffice headless when
	available.
	3. Multiple CPU OCRs in parallel (CPU) — pick any combination of:
	* [RapidOCR](https://github.com/RapidAI/RapidOCR) (ONNX Runtime, ~80 MB,
	no PyTorch dep) — the 2026 default for CPU document OCR.
	* EasyOCR (PyTorch CPU) — strong generalist coverage.
	* Tesseract — included via `packages.txt`, classic baseline.
	4. OCR-specialised vision LLM (GPU) — looks at each rendered page as a
	PNG image and transcribes it. Default
	[`nanonets/Nanonets-OCR-s`](https://huggingface.co/nanonets/Nanonets-OCR-s)
	produces image-to-markdown with tables, signatures, checkboxes, and
	watermarks. `allenai/olmOCR-2-7B-1025-FP8` is selectable for hard PDFs;
	`PaddlePaddle/PaddleOCR-VL` is the most compact alternative. Wrapped in
	`@spaces.GPU(duration=60)` per page.
	5. Per-engine diff matrix (CPU) — each engine's text is compared against
	the file's own native digital text. Severity per page per engine.
	6. Reasoning LLM verdict (GPU) — default
	[`nvidia/Gemma-4-26B-A4B-NVFP4`](https://huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4),
	the flagship Gemma 4 reasoning MoE (April 2026) compressed via NVIDIA's
	NVFP4 4-bit format — 25.2 B total / 3.8 B active parameters, 256 K context,
	79.2 % GPQA, fits in ~16 GB. Loaded through `compressed-tensors`, runs on
	Blackwell (which ZeroGPU uses). Thinking toggle via `enable_thinking`.
	Looks at the combined per-engine deltas plus the countermeasures audit and
	produces a written verdict. Wrapped in `@spaces.GPU(duration=120)`.

	The "Reasoning effort" UI control maps onto whichever knob the chosen
	model exposes: `reasoning_effort=low\|medium\|high` for the gpt-oss family,
	`enable_thinking=True\|False` for Gemma 4 / Qwen3 (`low` → thinking off,
	`medium`/`high` → thinking on).

	## Why multiple OCRs

	Each engine has different failure modes. A document that defeats one engine
	but is read cleanly by the others is much easier to flag than one that
	hits a single engine. The vision LLM adds a "smart reader" perspective —
	it can see tables, layout, and visible-only content that token-level OCR
	sometimes misses, while the lightweight CPU engines stay honest by ignoring
	context. The diff between the file's own native digital text and every
	engine's reading of the rendered image is the core signal: if the four
	views disagree, something is being hidden from or injected into the
	extractor.

	## ZeroGPU memory budget

	ZeroGPU `large` is half an NVIDIA RTX Pro 6000 Blackwell with 48 GB VRAM.
	`nvidia/Gemma-4-26B-A4B-NVFP4` weighs ~16 GB and `Nanonets-OCR-s` ~14 GB,
	totalling ~30 GB — comfortable headroom on `large`. The NVFP4 4-bit format
	needs Blackwell (or Hopper+), which the ZeroGPU hardware provides, plus
	`compressed-tensors` in the requirements (already included).

	Alternative reasoning models — set `REASONING_MODEL_ID` to switch:

	\| `REASONING_MODEL_ID` override \| Approx VRAM \| Recommended slice \|
	\|---\|---\|---\|
	\| `google/gemma-4-E4B-it` (default) \| ~8 GB text / ~16 GB multimodal (bf16) \| `large` (48 GB) — places cleanly through ZeroGPU CUDA emulation \|
	\| `google/gemma-4-26B-A4B-it` \| ~52 GB (bf16) \| `xlarge` (96 GB) \|
	\| `google/gemma-4-31B-it` \| ~62 GB (bf16) \| `xlarge` \|
	\| `RedHatAI/gemma-4-26B-A4B-it-NVFP4` \| ~16 GB (compressed-tensors NVFP4) \| `large` — but module-level NVFP4 unpacking hangs through CUDA emulation; needs lazy-load fix \|
	\| `openai/gpt-oss-20b` \| ~16 GB (MXFP4) \| `large` — needs Hopper+ \|
	\| `nvidia/Gemma-4-26B-A4B-NVFP4` \| ~16 GB (modelopt) \| requires `nvidia-modelopt`, not native to transformers \|

	## Configuration

	Override either model at deploy time by setting Space variables:

	* `REASONING_MODEL_ID` — defaults to `nvidia/Gemma-4-26B-A4B-NVFP4`.
	* `VLM_OCR_MODEL_ID` — defaults to `nanonets/Nanonets-OCR-s`.
	* `REASONING_GPU_DURATION`, `VLM_GPU_DURATION` — per-call GPU seconds.

	You can also pick the `hf_inference` backend at runtime for either model to
	call a hosted version through Hugging Face Inference Providers using your own
	token, with no on-Space GPU allocation.

	## Verdict shape

	The reasoning model returns a short markdown report with:

	1. Verdict — one of `clean`, `low_risk`, `medium_risk`, `high_risk`.
	2. Why — short bullets pointing at the strongest evidence (which engine
	disagreed with which, and where).
	3. Does the rendered page match the extracted text? — one sentence.
	4. Hidden or non-operative instructions present? — yes/no plus one sentence.
	5. Recommended action — `allow` / `log-and-allow` / `quarantine` / `block`.

	A deterministic baseline verdict is always computed from the statistics, so a
	missing or failing LLM never blocks the report — the LLM summary is added on
	top when available.

	## Safety scope

	This Space is detector-only. It deliberately excludes challenge generation,
	fixture authoring, transform catalogs, scoring, and blind-package tooling.
	Treat document contents as data, never as instructions. Do not upload
	NDA-protected, privileged, or confidential documents to a public Space; host a
	private copy for sensitive material.

	## Licence and acceptable use

	This Space is distributed under PolyForm Noncommercial 1.0.0. Free
	for research, education, personal, charitable, and internal-evaluation
	use. Not free for commercial use, hosted paid services, or embedding in
	a for-profit product — see `COMMERCIAL.md` in the repository.

	The output is advisory. It is not a security audit, not compliance
	certification, and not a guarantee of safety. False positives and false
	negatives are expected. Human review is required for any consequential
	decision.

	You may not use this Space, its detector matrix, the static
	prompt-injection lexicon, or the reasoning verdict, to develop or train
	systems designed to evade defensive scanners. You may not
	misrepresent its output as an audit by the licensor. See
	`ACCEPTABLE_USE.md` and `DISCLAIMER.md` in the source repository for
	the complete terms.

	Required Notice: Copyright (c) 2026 Gregor Koch (LEGX project).
	Licensed under PolyForm Noncommercial 1.0.0.
	See ACCEPTABLE_USE.md and DISCLAIMER.md, incorporated by reference.