cronos3k's picture
Initial deploy: Document Integrity Verifier (LEGX)
ea53eb9 verified
Raw
History Blame Contribute Delete
6.34 kB
LEGX toolkit — Document Integrity Verifier
Copyright (c) 2026 Gregor Koch.
Licensed under PolyForm Noncommercial 1.0.0 — see LICENSE.
Supplementary terms: ACCEPTABLE_USE.md and DISCLAIMER.md.
Commercial licensing: COMMERCIAL.md.
================================================================================
Third-party software incorporated, vendored, or required at runtime
================================================================================
The following third-party libraries are required to run the Software.
They retain their own copyright and licence. The Software's licence
does not override theirs; you must comply with each.
------------------------------------------------------------------------
gradio https://github.com/gradio-app/gradio
Apache License 2.0
spaces https://huggingface.co/docs/hub/en/spaces-zerogpu
Apache License 2.0
transformers https://github.com/huggingface/transformers
Apache License 2.0
accelerate https://github.com/huggingface/accelerate
Apache License 2.0
huggingface_hub https://github.com/huggingface/huggingface_hub
Apache License 2.0
kernels https://github.com/huggingface/kernels
Apache License 2.0
compressed-tensors https://github.com/neuralmagic/compressed-tensors
Apache License 2.0
torch https://pytorch.org
BSD 3-Clause
onnxruntime https://onnxruntime.ai
MIT
rapidocr-onnxruntime https://github.com/RapidAI/RapidOCR
Apache License 2.0
easyocr https://github.com/JaidedAI/EasyOCR
Apache License 2.0
pytesseract https://github.com/madmaze/pytesseract
Apache License 2.0
(wraps the Tesseract OCR binary, also Apache 2.0)
Pillow https://python-pillow.org
Historical Permission Notice and Disclaimer (HPND)
pypdf https://github.com/py-pdf/pypdf
BSD 3-Clause
reportlab https://www.reportlab.com
BSD-style
beautifulsoup4 https://www.crummy.com/software/BeautifulSoup
MIT
Jinja2 https://palletsprojects.com/p/jinja
BSD 3-Clause
------------------------------------------------------------------------
pypdfium2 https://github.com/pypdfium2-team/pypdfium2
Apache License 2.0 OR BSD-3-Clause (your choice)
Wraps PDFium (Google, BSD-3-Clause) — used for PDF rendering and
page-level text extraction in the detector path.
PDFium (vendored by pypdfium2)
BSD 3-Clause
------------------------------------------------------------------------
PyMuPDF (fitz) https://pymupdf.readthedocs.io
DUAL: GNU AGPL v3.0 OR Artifex Commercial Licence
PyMuPDF is referenced ONLY by the authoring-side modules
(`legal_doc_redteam/fixtures.py`) used to generate synthetic
red-team challenge documents. It is NOT shipped with the Document
Integrity Verifier Space; the export script
(`scripts/export_zerogpu_space.ps1`) deliberately excludes
`fixtures.py` and the entire authoring side. The detector path
uses pypdfium2 (Apache 2.0 / BSD-3) and pypdf (BSD-3) instead.
If you install the full LEGX package locally and use the
authoring side, you do so under PyMuPDF's AGPL v3 licence (or a
commercial PyMuPDF licence from Artifex Software, Inc., if your
use is commercial). Authoring is already restricted to
noncommercial use by the LEGX project's own PolyForm Noncommercial
licence, so the AGPL inheritance is moot for permitted use.
------------------------------------------------------------------------
System packages (declared in `hf_zerogpu_space/packages.txt`):
libreoffice Mozilla Public License 2.0 (LibreOffice core)
poppler-utils GPL v2 / GPL v3 (Poppler)
tesseract-ocr Apache License 2.0
When the Software is run on a host that uses these binaries via
subprocess (LibreOffice headless conversion, Poppler rendering,
Tesseract CLI), only their published interfaces are invoked; their
sources are not statically linked.
================================================================================
Model weights at runtime
================================================================================
The Software loads open model weights from Hugging Face at runtime.
Each carries its own licence; please read each model card before
production use.
nvidia/Gemma-4-26B-A4B-NVFP4 Gemma Terms of Use (Google) +
Gemma 4 Acceptable Use Policy
google/gemma-4-E4B-it Gemma Terms of Use (Google) +
Gemma 4 Acceptable Use Policy
nanonets/Nanonets-OCR-s See model card
allenai/olmOCR-2-7B-1025-FP8 Apache License 2.0
PaddlePaddle/PaddleOCR-VL See model card
openai/gpt-oss-20b Apache License 2.0
+ OpenAI usage policies
The Software does not redistribute these weights. It only references
their Hugging Face identifiers; weights are downloaded from
Hugging Face on first use.
================================================================================
Research sources for the static lexicon
================================================================================
The static prompt-injection lexicon (`legal_doc_redteam/injection_lexicon.py`
and `injection_lexicon_multilingual.py`) was assembled from public
research and freely-available databases. Each pattern carries an
inline `source` field; see those files for per-pattern attribution.
Notable sources include:
OWASP LLM Top 10 (LLM01:2025)
MITRE ATLAS — Adversarial Threat Landscape for AI Systems
Meta PurpleLlama / Llama-Prompt-Guard
USENIX Security 2024-2025 prompt-injection papers
NIST AI safety guidance
JailbreakHub / TrustAIRLab in-the-wild prompts
ChatGPT_DAN repository (0xk1h0)
HackAPrompt 2024-2025
Tensor Trust dataset
NVIDIA garak probes
deepset/prompt-injections dataset
Lakera, Snyk Labs, Unit 42, CrowdStrike, Microsoft published research
The patterns themselves are facts about how attacks are phrased and
are not subject to copyright. Attribution is preserved out of
academic courtesy and to make it easy for users to audit provenance.