document-extract-agent / docs /04_project_setup.md
kennethzychew's picture
seed: specs + loop scaffolding
3a5b10f
|
Raw
History Blame Contribute Delete
7.54 kB
# Project Setup, Stack & Deployment
## 1. Repository layout
```
doc-extraction-agent/
β”œβ”€β”€ CLAUDE.md # conventions & guardrails for the coding agent
β”œβ”€β”€ README.md # quickstart + results table (eval evidence)
β”œβ”€β”€ pyproject.toml # project + dependency declarations (managed by uv)
β”œβ”€β”€ uv.lock # resolved, pinned dependency lock (committed)
β”œβ”€β”€ .python-version # uv interpreter pin: 3.11 (committed)
β”œβ”€β”€ .env.example # config template (no secrets committed)
β”œβ”€β”€ docs/
β”‚ β”œβ”€β”€ 01_requirements.md
β”‚ β”œβ”€β”€ 02_architecture.md
β”‚ β”œβ”€β”€ 03_data_and_extraction_spec.md
β”‚ └── 05_build_plan.md
β”œβ”€β”€ src/doc_agent/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ config.py # loads env/config; selects backend
β”‚ β”œβ”€β”€ core.py # process_document(): the reusable pipeline
β”‚ β”œβ”€β”€ schema/
β”‚ β”‚ └── models.py # Pydantic Document, LineItem
β”‚ β”œβ”€β”€ parsing/
β”‚ β”‚ β”œβ”€β”€ detect.py # modality detection
β”‚ β”‚ β”œβ”€β”€ docling_parser.py # native PDF β†’ text/layout
β”‚ β”‚ └── ocr.py # image β†’ text (optional path)
β”‚ β”œβ”€β”€ backends/
β”‚ β”‚ β”œβ”€β”€ base.py # ExtractionBackend protocol + factory
β”‚ β”‚ β”œβ”€β”€ gemini.py # free-tier multimodal adapter
β”‚ β”‚ └── ollama.py # local model adapter
β”‚ β”œβ”€β”€ validation/
β”‚ β”‚ └── rules.py # hard/soft rules β†’ report
β”‚ β”œβ”€β”€ routing/
β”‚ β”‚ └── score.py # confidence + decision (pure)
β”‚ β”œβ”€β”€ store/
β”‚ β”‚ β”œβ”€β”€ db.py # SQLite writer
β”‚ β”‚ └── export.py # CSV export
β”‚ β”œβ”€β”€ ingest/
β”‚ β”‚ └── watcher.py # folder watcher / poll loop (batch entry)
β”‚ └── web/
β”‚ └── app.py # Gradio demo (URL entry)
β”œβ”€β”€ eval/
β”‚ β”œβ”€β”€ run_eval.py # metrics over labelled datasets
β”‚ └── datasets/ # download scripts / loaders (no data in git)
β”œβ”€β”€ data/ # gitignored: inbox/ processed/ review/ exports/
β”‚ β”œβ”€β”€ inbox/
β”‚ β”œβ”€β”€ processed/
β”‚ β”œβ”€β”€ review/
β”‚ └── exports/
└── tests/
β”œβ”€β”€ test_validation.py
β”œβ”€β”€ test_routing.py
β”œβ”€β”€ test_schema.py
└── test_core_smoke.py
```
## 2. Stack
- **Runtime:** Python **3.11**, pinned via `.python-version` (`uv python pin
3.11`). Chosen over 3.12 for broadest wheel coverage across the Torch-based
Docling stack and PaddleOCR/PaddlePaddle, which lags newest Pythons.
Declared range: `requires-python = ">=3.11"`.
- **Package manager:** `uv` (manages the venv, resolves and locks deps via
`uv.lock`; add deps with `uv add`, run with `uv run`).
- **Parsing:** `docling` (native PDF/scan structure). Optional OCR:
`paddleocr` or `pytesseract` + system Tesseract.
- **Modeling:** `google-genai` (Gemini free tier) and a local `ollama` server
(e.g. `qwen2.5:7b` or a 3B variant) reached over HTTP.
- **Contract/validation:** `pydantic` v2.
- **Web demo:** `gradio`.
- **Storage:** stdlib `sqlite3` + `csv`.
- **Watcher:** `watchdog` (or a stdlib poll loop for max portability).
- **Config:** `pydantic-settings` / `python-dotenv`.
- **Testing:** `pytest`.
Dependencies are declared in `pyproject.toml` and pinned via the committed
`uv.lock` (`uv sync` installs exactly that lock). Do not float the model
identifier in code β€” it is config (see guardrails).
## 3. Configuration (`.env.example`)
```
# Backend selection: "gemini" | "ollama"
EXTRACTION_BACKEND=gemini
# Gemini (free tier via Google AI Studio key; no card required)
GEMINI_API_KEY=
GEMINI_MODEL=gemini-flash-latest # identifier is config, not hardcoded
# Ollama (local)
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5:7b
# Image handling: "vision_direct" | "ocr_then_text"
IMAGE_STRATEGY=vision_direct # vision_direct requires a multimodal backend
# Routing
CONFIDENCE_THRESHOLD=0.85 # tuned via eval
# Paths (batch mode)
INBOX_DIR=./data/inbox
PROCESSED_DIR=./data/processed
REVIEW_DIR=./data/review
EXPORT_DIR=./data/exports
DB_PATH=./data/agent.db
```
`config.py` validates these at startup and fails fast with a clear message if,
e.g., `gemini` is selected with no key, or `vision_direct` is selected with a
text-only backend.
## 4. Local setup
```bash
# 1. Pin the interpreter to 3.11 (writes .python-version; uv fetches it if absent)
uv python pin 3.11
# 2. Install (uv creates the venv on 3.11 and installs from pyproject.toml + uv.lock)
uv sync
# 3a. Gemini path: get a free AI Studio key, put it in .env
# (free tier, no credit card; quota resets daily)
# 3b. Ollama path (offline/private):
# install Ollama, then:
ollama pull qwen2.5:7b
# set EXTRACTION_BACKEND=ollama and IMAGE_STRATEGY=ocr_then_text
# 4. Create working dirs
mkdir -p data/{inbox,processed,review,exports}
```
## 5. Running
**Autonomous batch mode:**
```bash
uv run python -m doc_agent.ingest.watcher
# drop files into data/inbox/ β€” accepted records land in SQLite + data/exports/,
# uncertain ones move to data/review/
```
**Web demo (local):**
```bash
uv run python -m doc_agent.web.app
# opens a Gradio URL; upload one document to see fields + confidence + decision
```
**Evaluation:**
```bash
uv run python eval/run_eval.py --dataset sroie --split holdout
# prints per-field precision/recall/F1 and auto-accept precision on critical fields
```
## 6. Deployment to Hugging Face Spaces (free public demo URL)
1. Create a new **Space** β†’ SDK: **Gradio** (free CPU tier). Set the Space's
Python to **3.11** (the `python_version: "3.11"` field in the Space README
metadata) so the deployed runtime matches the pinned local interpreter.
2. Add `app.py` at the Space root that imports and launches
`doc_agent.web.app` (or copy the web entry there), plus a `requirements.txt`
the Gradio builder can read β€” generate it from the uv-managed project rather
than hand-maintaining it: `uv export --no-hashes --no-dev -o requirements.txt`.
3. Set **Repository secrets** in the Space: `GEMINI_API_KEY`,
`EXTRACTION_BACKEND=gemini`, `IMAGE_STRATEGY=vision_direct`,
`GEMINI_MODEL=gemini-flash-latest`.
4. Push; the Space builds and serves a public URL.
**Free-tier realities to design around (and to note in the UI):**
- CPU-only and the Space **sleeps when idle** β†’ first request after idle has a
cold start. This is why the cloud demo uses the **Gemini API** for inference
rather than a local model, and why `vision_direct` (no heavy OCR in the
Space) is the demo's image path.
- **Stateless:** no persistent DB in the demo. Show the result; don't store it.
- **Privacy:** the free Gemini tier may use inputs for training, so the demo
must display a "synthetic/public documents only" notice and must not be used
for real financial data.
## 7. What stays free
- **Inference:** local Ollama (no quota, private) or Gemini free tier
(~1,500 req/day, resets daily, no card) β€” far above dev volume.
- **Hosting:** Hugging Face Spaces free CPU tier for the public demo.
- **Storage:** local SQLite/CSV; nothing paid.
No component requires a credit card or paid plan for development or demo.