document-extract-agent / docs /04_project_setup.md
kennethzychew's picture
seed: specs + loop scaffolding
3a5b10f
|
Raw
History Blame Contribute Delete
7.54 kB

Project Setup, Stack & Deployment

1. Repository layout

doc-extraction-agent/
β”œβ”€β”€ CLAUDE.md                     # conventions & guardrails for the coding agent
β”œβ”€β”€ README.md                     # quickstart + results table (eval evidence)
β”œβ”€β”€ pyproject.toml                # project + dependency declarations (managed by uv)
β”œβ”€β”€ uv.lock                       # resolved, pinned dependency lock (committed)
β”œβ”€β”€ .python-version               # uv interpreter pin: 3.11 (committed)
β”œβ”€β”€ .env.example                  # config template (no secrets committed)
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ 01_requirements.md
β”‚   β”œβ”€β”€ 02_architecture.md
β”‚   β”œβ”€β”€ 03_data_and_extraction_spec.md
β”‚   └── 05_build_plan.md
β”œβ”€β”€ src/doc_agent/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py                 # loads env/config; selects backend
β”‚   β”œβ”€β”€ core.py                   # process_document(): the reusable pipeline
β”‚   β”œβ”€β”€ schema/
β”‚   β”‚   └── models.py             # Pydantic Document, LineItem
β”‚   β”œβ”€β”€ parsing/
β”‚   β”‚   β”œβ”€β”€ detect.py             # modality detection
β”‚   β”‚   β”œβ”€β”€ docling_parser.py     # native PDF β†’ text/layout
β”‚   β”‚   └── ocr.py                # image β†’ text (optional path)
β”‚   β”œβ”€β”€ backends/
β”‚   β”‚   β”œβ”€β”€ base.py               # ExtractionBackend protocol + factory
β”‚   β”‚   β”œβ”€β”€ gemini.py             # free-tier multimodal adapter
β”‚   β”‚   └── ollama.py             # local model adapter
β”‚   β”œβ”€β”€ validation/
β”‚   β”‚   └── rules.py              # hard/soft rules β†’ report
β”‚   β”œβ”€β”€ routing/
β”‚   β”‚   └── score.py              # confidence + decision (pure)
β”‚   β”œβ”€β”€ store/
β”‚   β”‚   β”œβ”€β”€ db.py                 # SQLite writer
β”‚   β”‚   └── export.py             # CSV export
β”‚   β”œβ”€β”€ ingest/
β”‚   β”‚   └── watcher.py            # folder watcher / poll loop (batch entry)
β”‚   └── web/
β”‚       └── app.py                # Gradio demo (URL entry)
β”œβ”€β”€ eval/
β”‚   β”œβ”€β”€ run_eval.py               # metrics over labelled datasets
β”‚   └── datasets/                 # download scripts / loaders (no data in git)
β”œβ”€β”€ data/                         # gitignored: inbox/ processed/ review/ exports/
β”‚   β”œβ”€β”€ inbox/
β”‚   β”œβ”€β”€ processed/
β”‚   β”œβ”€β”€ review/
β”‚   └── exports/
└── tests/
    β”œβ”€β”€ test_validation.py
    β”œβ”€β”€ test_routing.py
    β”œβ”€β”€ test_schema.py
    └── test_core_smoke.py

2. Stack

  • Runtime: Python 3.11, pinned via .python-version (uv python pin 3.11). Chosen over 3.12 for broadest wheel coverage across the Torch-based Docling stack and PaddleOCR/PaddlePaddle, which lags newest Pythons. Declared range: requires-python = ">=3.11".
  • Package manager: uv (manages the venv, resolves and locks deps via uv.lock; add deps with uv add, run with uv run).
  • Parsing: docling (native PDF/scan structure). Optional OCR: paddleocr or pytesseract + system Tesseract.
  • Modeling: google-genai (Gemini free tier) and a local ollama server (e.g. qwen2.5:7b or a 3B variant) reached over HTTP.
  • Contract/validation: pydantic v2.
  • Web demo: gradio.
  • Storage: stdlib sqlite3 + csv.
  • Watcher: watchdog (or a stdlib poll loop for max portability).
  • Config: pydantic-settings / python-dotenv.
  • Testing: pytest.

Dependencies are declared in pyproject.toml and pinned via the committed uv.lock (uv sync installs exactly that lock). Do not float the model identifier in code β€” it is config (see guardrails).

3. Configuration (.env.example)

# Backend selection: "gemini" | "ollama"
EXTRACTION_BACKEND=gemini

# Gemini (free tier via Google AI Studio key; no card required)
GEMINI_API_KEY=
GEMINI_MODEL=gemini-flash-latest        # identifier is config, not hardcoded

# Ollama (local)
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5:7b

# Image handling: "vision_direct" | "ocr_then_text"
IMAGE_STRATEGY=vision_direct            # vision_direct requires a multimodal backend

# Routing
CONFIDENCE_THRESHOLD=0.85               # tuned via eval

# Paths (batch mode)
INBOX_DIR=./data/inbox
PROCESSED_DIR=./data/processed
REVIEW_DIR=./data/review
EXPORT_DIR=./data/exports
DB_PATH=./data/agent.db

config.py validates these at startup and fails fast with a clear message if, e.g., gemini is selected with no key, or vision_direct is selected with a text-only backend.

4. Local setup

# 1. Pin the interpreter to 3.11 (writes .python-version; uv fetches it if absent)
uv python pin 3.11

# 2. Install (uv creates the venv on 3.11 and installs from pyproject.toml + uv.lock)
uv sync

# 3a. Gemini path: get a free AI Studio key, put it in .env
#     (free tier, no credit card; quota resets daily)

# 3b. Ollama path (offline/private):
#     install Ollama, then:
ollama pull qwen2.5:7b
#     set EXTRACTION_BACKEND=ollama and IMAGE_STRATEGY=ocr_then_text

# 4. Create working dirs
mkdir -p data/{inbox,processed,review,exports}

5. Running

Autonomous batch mode:

uv run python -m doc_agent.ingest.watcher
# drop files into data/inbox/ β€” accepted records land in SQLite + data/exports/,
# uncertain ones move to data/review/

Web demo (local):

uv run python -m doc_agent.web.app
# opens a Gradio URL; upload one document to see fields + confidence + decision

Evaluation:

uv run python eval/run_eval.py --dataset sroie --split holdout
# prints per-field precision/recall/F1 and auto-accept precision on critical fields

6. Deployment to Hugging Face Spaces (free public demo URL)

  1. Create a new Space β†’ SDK: Gradio (free CPU tier). Set the Space's Python to 3.11 (the python_version: "3.11" field in the Space README metadata) so the deployed runtime matches the pinned local interpreter.
  2. Add app.py at the Space root that imports and launches doc_agent.web.app (or copy the web entry there), plus a requirements.txt the Gradio builder can read β€” generate it from the uv-managed project rather than hand-maintaining it: uv export --no-hashes --no-dev -o requirements.txt.
  3. Set Repository secrets in the Space: GEMINI_API_KEY, EXTRACTION_BACKEND=gemini, IMAGE_STRATEGY=vision_direct, GEMINI_MODEL=gemini-flash-latest.
  4. Push; the Space builds and serves a public URL.

Free-tier realities to design around (and to note in the UI):

  • CPU-only and the Space sleeps when idle β†’ first request after idle has a cold start. This is why the cloud demo uses the Gemini API for inference rather than a local model, and why vision_direct (no heavy OCR in the Space) is the demo's image path.
  • Stateless: no persistent DB in the demo. Show the result; don't store it.
  • Privacy: the free Gemini tier may use inputs for training, so the demo must display a "synthetic/public documents only" notice and must not be used for real financial data.

7. What stays free

  • Inference: local Ollama (no quota, private) or Gemini free tier (~1,500 req/day, resets daily, no card) β€” far above dev volume.
  • Hosting: Hugging Face Spaces free CPU tier for the public demo.
  • Storage: local SQLite/CSV; nothing paid.

No component requires a credit card or paid plan for development or demo.