Spaces:

knzychw
/

document-extract-agent

Running

App Files Files Community

document-extract-agent / docs /04_project_setup.md

kennethzychew

seed: specs + loop scaffolding

3a5b10f 5 days ago

preview code

Raw

History Blame Contribute Delete

7.54 kB

Project Setup, Stack & Deployment

1. Repository layout

doc-extraction-agent/
├── CLAUDE.md                     # conventions & guardrails for the coding agent
├── README.md                     # quickstart + results table (eval evidence)
├── pyproject.toml                # project + dependency declarations (managed by uv)
├── uv.lock                       # resolved, pinned dependency lock (committed)
├── .python-version               # uv interpreter pin: 3.11 (committed)
├── .env.example                  # config template (no secrets committed)
├── docs/
│   ├── 01_requirements.md
│   ├── 02_architecture.md
│   ├── 03_data_and_extraction_spec.md
│   └── 05_build_plan.md
├── src/doc_agent/
│   ├── __init__.py
│   ├── config.py                 # loads env/config; selects backend
│   ├── core.py                   # process_document(): the reusable pipeline
│   ├── schema/
│   │   └── models.py             # Pydantic Document, LineItem
│   ├── parsing/
│   │   ├── detect.py             # modality detection
│   │   ├── docling_parser.py     # native PDF → text/layout
│   │   └── ocr.py                # image → text (optional path)
│   ├── backends/
│   │   ├── base.py               # ExtractionBackend protocol + factory
│   │   ├── gemini.py             # free-tier multimodal adapter
│   │   └── ollama.py             # local model adapter
│   ├── validation/
│   │   └── rules.py              # hard/soft rules → report
│   ├── routing/
│   │   └── score.py              # confidence + decision (pure)
│   ├── store/
│   │   ├── db.py                 # SQLite writer
│   │   └── export.py             # CSV export
│   ├── ingest/
│   │   └── watcher.py            # folder watcher / poll loop (batch entry)
│   └── web/
│       └── app.py                # Gradio demo (URL entry)
├── eval/
│   ├── run_eval.py               # metrics over labelled datasets
│   └── datasets/                 # download scripts / loaders (no data in git)
├── data/                         # gitignored: inbox/ processed/ review/ exports/
│   ├── inbox/
│   ├── processed/
│   ├── review/
│   └── exports/
└── tests/
    ├── test_validation.py
    ├── test_routing.py
    ├── test_schema.py
    └── test_core_smoke.py

2. Stack

Runtime: Python 3.11, pinned via .python-version (uv python pin 3.11). Chosen over 3.12 for broadest wheel coverage across the Torch-based Docling stack and PaddleOCR/PaddlePaddle, which lags newest Pythons. Declared range: requires-python = ">=3.11".
Package manager: uv (manages the venv, resolves and locks deps via uv.lock; add deps with uv add, run with uv run).
Parsing: docling (native PDF/scan structure). Optional OCR: paddleocr or pytesseract + system Tesseract.
Modeling: google-genai (Gemini free tier) and a local ollama server (e.g. qwen2.5:7b or a 3B variant) reached over HTTP.
Contract/validation: pydantic v2.
Web demo: gradio.
Storage: stdlib sqlite3 + csv.
Watcher: watchdog (or a stdlib poll loop for max portability).
Config: pydantic-settings / python-dotenv.
Testing: pytest.

Dependencies are declared in pyproject.toml and pinned via the committed uv.lock (uv sync installs exactly that lock). Do not float the model identifier in code — it is config (see guardrails).

3. Configuration (`.env.example`)

# Backend selection: "gemini" | "ollama"
EXTRACTION_BACKEND=gemini

# Gemini (free tier via Google AI Studio key; no card required)
GEMINI_API_KEY=
GEMINI_MODEL=gemini-flash-latest        # identifier is config, not hardcoded

# Ollama (local)
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5:7b

# Image handling: "vision_direct" | "ocr_then_text"
IMAGE_STRATEGY=vision_direct            # vision_direct requires a multimodal backend

# Routing
CONFIDENCE_THRESHOLD=0.85               # tuned via eval

# Paths (batch mode)
INBOX_DIR=./data/inbox
PROCESSED_DIR=./data/processed
REVIEW_DIR=./data/review
EXPORT_DIR=./data/exports
DB_PATH=./data/agent.db

config.py validates these at startup and fails fast with a clear message if, e.g., gemini is selected with no key, or vision_direct is selected with a text-only backend.

4. Local setup

# 1. Pin the interpreter to 3.11 (writes .python-version; uv fetches it if absent)
uv python pin 3.11

# 2. Install (uv creates the venv on 3.11 and installs from pyproject.toml + uv.lock)
uv sync

# 3a. Gemini path: get a free AI Studio key, put it in .env
#     (free tier, no credit card; quota resets daily)

# 3b. Ollama path (offline/private):
#     install Ollama, then:
ollama pull qwen2.5:7b
#     set EXTRACTION_BACKEND=ollama and IMAGE_STRATEGY=ocr_then_text

# 4. Create working dirs
mkdir -p data/{inbox,processed,review,exports}

5. Running

Autonomous batch mode:

uv run python -m doc_agent.ingest.watcher
# drop files into data/inbox/ — accepted records land in SQLite + data/exports/,
# uncertain ones move to data/review/

Web demo (local):

uv run python -m doc_agent.web.app
# opens a Gradio URL; upload one document to see fields + confidence + decision

Evaluation:

uv run python eval/run_eval.py --dataset sroie --split holdout
# prints per-field precision/recall/F1 and auto-accept precision on critical fields

6. Deployment to Hugging Face Spaces (free public demo URL)

Create a new Space → SDK: Gradio (free CPU tier). Set the Space's Python to 3.11 (the python_version: "3.11" field in the Space README metadata) so the deployed runtime matches the pinned local interpreter.
Add app.py at the Space root that imports and launches doc_agent.web.app (or copy the web entry there), plus a requirements.txt the Gradio builder can read — generate it from the uv-managed project rather than hand-maintaining it: uv export --no-hashes --no-dev -o requirements.txt.
Set Repository secrets in the Space: GEMINI_API_KEY, EXTRACTION_BACKEND=gemini, IMAGE_STRATEGY=vision_direct, GEMINI_MODEL=gemini-flash-latest.
Push; the Space builds and serves a public URL.

Free-tier realities to design around (and to note in the UI):

CPU-only and the Space sleeps when idle → first request after idle has a cold start. This is why the cloud demo uses the Gemini API for inference rather than a local model, and why vision_direct (no heavy OCR in the Space) is the demo's image path.
Stateless: no persistent DB in the demo. Show the result; don't store it.
Privacy: the free Gemini tier may use inputs for training, so the demo must display a "synthetic/public documents only" notice and must not be used for real financial data.

7. What stays free

Inference: local Ollama (no quota, private) or Gemini free tier (~1,500 req/day, resets daily, no card) — far above dev volume.
Hosting: Hugging Face Spaces free CPU tier for the public demo.
Storage: local SQLite/CSV; nothing paid.

No component requires a credit card or paid plan for development or demo.