AGENTS.md
Context for AI coding agents working on doc_redaction (PII redaction for PDFs, images, Word, and tabular files). Human-oriented docs: README.md. User guide: doc_redaction user guide.
Project overview
- Stack: Python 3.10+, Gradio UI (app.py), optional FastAPI when
RUN_FASTAPIis enabled, AWS/LLM integrations via tools/config.py and env files underconfig/. - License: AGPL-3.0-only (see pyproject.toml). Respect license terms when adding dependencies.
- Accuracy: Outputs are not guaranteed complete; downstream use should assume human review of redacted material.
Cursor skills: redaction workflow (optional)
For agents operating the deployed app (Gradio Client, review CSV, /review_apply), these repo-local playbooks are a suggested ladder:
skills/doc-redaction-task-prompt/TASK_PROMPT_TEMPLATE.mdβ copy-paste user task prompt (Pass 1 default, Pass 2 gated); user redaction requirements go at the end of the prompt.skills/doc-redaction-app/SKILL.mdβ first-pass redaction (/doc_redact//redact_document) and downloading artifacts.skills/doc-redact-page-review/SKILL.mdβ after outputs exist: parallel per-page child agents, merge into one full-document*_review_file.csv, single/review_applyfrom the parent.skills/doc-redaction-modifications/SKILL.mdβ CSV mechanics,preview_redaction_boxes,/review_applypatterns, verification, VLM and PyMuPDF fallbacks (single-thread edits and the technical reference for page-review children).
Setup
- System: Install Tesseract and Poppler (required for OCR/PDF). See README.md (Windows/Linux sections).
- Python: Create a venv, then install the project (e.g.
pip install -e ".[dev]"or follow README). - Configuration: Copy or edit environment/config as described in README /
config/(e.g.app_config.env). Do not commit secrets.
Run locally
- Gradio/FastAPI entrypoint is app.py. With FastAPI enabled, typical pattern is
uvicorn app:app --host 0.0.0.0 --port 7860(exact host/port from your config). - OpenAPI docs:
/docswhen the FastAPI app is mounted.
Tests
- Run from repo root:
pytest(optional:pytest test/). - Fix failures related to your changes before opening a PR.
Line order (local OCR and simple text extraction)
Multi-column layouts use shared logic in tools/ocr_reading_order.py. Controlled by LOCAL_OCR_READING_ORDER (column default, legacy for previous top-left behaviour).
Local OCR (Paddle/Tesseract)
Word boxes are merged into line-level CSV rows in combine_ocr_results.
column: detect text columns, assign line numbers down each column left-to-right; full-width lines (headers) first. Stops cross-column merging that produced wide erroneous lines on multi-column PDFs. Auto-fallback: the page is treated as single-column unless a consecutive cluster of gutter rows (y-gap between adjacent rows β€OCR_COLUMN_MAX_CONSECUTIVE_GUTTER_GAP, default0.06of page height) has β₯OCR_COLUMN_MIN_GUTTER_ROWS(default3) rows and the cluster's topmost row is above the footer zone (OCR_COLUMN_FOOTER_ZONE_FRACTION, default0.75). This prevents isolated header bands (logo | title, 1 gutter row), signature-only blocks at the page bottom (cluster starts at y β₯ 0.75), or the combination of both, from forcing column mode on the single-column body text between them.PADDLE_PRESERVE_LINE_BOXES=TrueorCONVERT_LINE_TO_WORD_LEVEL=Falsewith Paddle: keep Paddle line boxes (skip word split + regrouping); line numbers still use column reading order.
Simple text extraction (PyMuPDF)
redact_text_pdf β process_page_to_structured_ocr_pymupdf calls reorder_structured_text_lines after collecting lines, using page.mediabox width/height for full-span header detection.
reorder_structured_text_lines now mirrors build_line_groups (local OCR route):
- Column-aware sort (
sort_reading_order/assign_layout_boxes/detect_column_split_xpoints) β or legacy top-left for single-column pages. - Y-band grouping (
group_into_lines) β merges any same-row PyMuPDF lines that were emitted as separate objects (e.g. mixed-font spans) and splits horizontally-disparate boxes via_finalize_line. Column mode only. - Secondary sub-column pass (
_reorder_lines_column_major) β ensures correct column-major order when sub-columns sit within a single macro-column. Column mode only. - When a group contains more than one box, constituent boxes are merged into a single
OCRResult(union bbox, joined text, concatenated chars/words).
In single-column / legacy mode only step 1 is applied; PyMuPDF lines are pre-formed so no merging is needed.
Tunables (both routes)
OCR_FULL_SPAN_WIDTH_RATIO, OCR_COLUMN_GAP_MIN_FRACTION, OCR_COLUMN_GUTTER_MIN_FRACTION, OCR_COLUMN_SUBGUTTER_MIN_FRACTION (default 0.015 β fine-grained gutter scan in assign_layout_boxes; lower = detects narrower sub-column boundaries), OCR_COLUMN_MIN_GUTTER_ROWS, OCR_COLUMN_MAX_BOX_HEIGHT_RATIO, OCR_COLUMN_MAX_CONSECUTIVE_GUTTER_GAP, OCR_COLUMN_FOOTER_ZONE_FRACTION, OCR_LINE_SPLIT_GAP_FRACTION (default 0.025 β horizontal gap fraction that forces a line split; must be below the narrowest column gutter, ~0.030 for two-page spreads; also used as the gap threshold for the secondary sub-column sort in build_line_groups), OCR_LINE_Y_THRESHOLD_FRACTION (default 0.013 β row-alignment tolerance as a fraction of page height; reduced from 0.015 to correctly separate tightly-set 10 pt body text whose row spacing is ~0.014), OCR_LINE_Y_THRESHOLD_MIN_PX.
Sub-column ordering (build_line_groups): after the primary word-level column sort, a second pass (_reorder_lines_column_major) clusters the produced line groups by their leftmost x-position using OCR_LINE_SPLIT_GAP_FRACTION as the gap threshold. This ensures that adjacent narrow sub-columns whose word-level centre gap is below column_gap_threshold (e.g. two columns on a spread where each page is already one macro-column) are still output in left-to-right column-major order rather than interleaved by y-position.
Fine-grained gutter-based column assignment (assign_layout_boxes): before falling back to centre-gap clustering, detect_column_split_xpoints scans the page for structural gutters at the finer OCR_COLUMN_SUBGUTTER_MIN_FRACTION threshold (default 0.015). Each qualifying gutter cluster produces a (split_x, y_min) pair β the split point is only applied to boxes whose top β₯ y_min, preventing a narrow sub-column gutter (visible only in the lower two-column section) from mis-splitting a full-width introductory paragraph that sits above it. This correctly separates narrow adjacent columns (e.g. 1.9 % gutter on a two-page spread) without fragmenting full-width headings or paragraphs.
Changing line order affects PII page text, duplicate-page detection, and review CSV line indices on multi-column documents; re-review after upgrading.
Agentic / programmatic access (two surfaces)
1. FastAPI Agent API (recommended for LLM agents: small JSON bodies)
When RUN_FASTAPI is true, routes are mounted under /agent (agent_routes.py).
- Catalog:
GET /agent/operationsβ maps each Gradioapi_nameto an HTTP path and notes whether the route is implemented via CLI or returns HTTP 501 for Gradio-only flows. - Implemented POST routes (CLI- or tools/simplified_api.py-backed where noted):
redact_document,redact_data,find_duplicate_pages,find_duplicate_tabular,summarise_document,combine_review_pdfs,combine_review_csvs,export_review_redaction_overlay,export_review_page_ocr_visualisation,apply_review_redactions,verify_redaction_coverage(Pass 1 QA:must_redact/must_not_redactregex lists, optionalredacted_pdf_path, optionalauto_prune_suspicious+pruned_output_path; returnspass_strict,pass_with_cleanup,pages_flagged_for_vlm,pages_needing_csv_cleanup),word_level_ocr_text_search(headless word OCR search with optional review-box overlap flags).
Optional post-redaction Pass 1 QA (main app / CLI): When POST_REDACT_PASS1_QA=True in tools/config.py (or config/app_config.env), initial redaction emits *_coverage_report.json beside the review CSV and optionally *_review_file_pruned.csv (sibling, when POST_REDACT_PASS1_AUTO_PRUNE=True). Uses deny/allow lists and/or POST_REDACT_PASS1_MUST_REDACT_PATH / POST_REDACT_PASS1_MUST_NOT_REDACT_PATH. CLI overrides: --post-redact-pass1-qa, --post-redact-pass1-auto-prune. This is pre-review-apply sanity QA only β agent Pass 1 (policy edits + /review_apply) remains separate.
Note: on Gradio (app.py), the Review-tab visual exports use api_name page_redaction_review_image and page_ocr_review_image; the /agent routes above keep the explicit export_review_* names for the same operations.
- Gradio-only stubs (501 + JSON hint):
load_and_prepare_documents_or_data. - Auth: If
AGENT_API_KEYis set in the environment, send headerX-Agent-API-Keywith that value. - Paths: Inputs must resolve to files under the repo root,
INPUT_FOLDER, orOUTPUT_FOLDER(see router validation).
Implementation uses cli_redact.main(direct_mode_args=...) where a CLI task exists (same behaviour as cli_redact.py); apply_review_redactions calls tools/simplified_api.py instead.
2. Gradio Client API (e.g. Hugging Face Spaces)
For remote Spaces or any Gradio deployment exposing the HTTP API:
- Schema:
GET https://<host>/gradio_api/info - Call:
POST https://<host>/gradio_api/call/{api_name}with body{"data":[...]}(argument order matches the named endpointβs component list). - Poll:
GET https://<host>/gradio_api/call/{api_name}/{event_id} - Hugging Face:
Authorization: Bearer $HF_TOKEN
Named api_name values in this app include: redact_document, load_and_prepare_documents_or_data, apply_review_redactions, doc_redact (simple gr.api: one PDF/image + optional OCR/PII knobs; returns (output_paths, message); api_name='/doc_redact'; parameters include document_file, redact_entities, output_dir, ocr_method, pii_method, allow_list, deny_list, page_min, page_max, handwrite_signature_checkbox β AWS Textract extraction options such as Extract handwriting / Extract signatures), review_apply (simple gr.api: PDF + *_review_file.csv; returns (output_paths, message); api_name='/review_apply'), preview_boxes (simple gr.api: PDF + *_review_file.csv; renders proposed boxes onto the original PDF and returns (zip_path, message) β use to verify coordinates before calling review_apply, no redaction applied; api_name='/preview_boxes'), pdf_summarise (simple gr.api: PDF + optional summarisation/OCR knobs; returns (output_paths, status_message, summary_text); api_name='/pdf_summarise'), tabular_redact (simple gr.api: one tabular file (CSV/XLSX/Parquet/DOCX) + optional knobs; returns (output_paths, message); api_name='/tabular_redact'), page_redaction_review_image (short review overlay export; api_name='/page_redaction_review_image'), page_ocr_review_image (short OCR visualisation export; api_name='/page_ocr_review_image'), word_level_ocr_text_search, redact_data, find_duplicate_pages, find_duplicate_tabular, summarise_document, combine_review_csvs, combine_review_pdfs. The matching POST /agent names for those two visual exports are export_review_redaction_overlay and export_review_page_ocr_visualisation (Β§1). Many endpoints require many positional arguments (full Gradio state); prefer the short gr.api routes above or POST /agent/apply_review_redactions where applicable instead of building the full data array from /gradio_api/info.
CLI parity
For scripting and tests, python cli_redact.py with flags is authoritative; programmatic merges use get_cli_default_args_dict() in cli_redact.py.
Security and data handling
- Do not commit API keys, tokens, or customer data.
- Treat paths as untrusted outside validated roots (see tools/secure_path_utils.py).
- Optional
instruction/ LLM fields must not be passed into shell or unconstrained config keys.