Spaces:

knzychw
/

document-extract-agent

Running

App Files Files Community

document-extract-agent / docs /04_project_setup.md

kennethzychew

seed: specs + loop scaffolding

3a5b10f 5 days ago

preview code

Raw

History Blame Contribute Delete

7.54 kB

	# Project Setup, Stack & Deployment

	## 1. Repository layout

	```
	doc-extraction-agent/
	├── CLAUDE.md # conventions & guardrails for the coding agent
	├── README.md # quickstart + results table (eval evidence)
	├── pyproject.toml # project + dependency declarations (managed by uv)
	├── uv.lock # resolved, pinned dependency lock (committed)
	├── .python-version # uv interpreter pin: 3.11 (committed)
	├── .env.example # config template (no secrets committed)
	├── docs/
	│ ├── 01_requirements.md
	│ ├── 02_architecture.md
	│ ├── 03_data_and_extraction_spec.md
	│ └── 05_build_plan.md
	├── src/doc_agent/
	│ ├── __init__.py
	│ ├── config.py # loads env/config; selects backend
	│ ├── core.py # process_document(): the reusable pipeline
	│ ├── schema/
	│ │ └── models.py # Pydantic Document, LineItem
	│ ├── parsing/
	│ │ ├── detect.py # modality detection
	│ │ ├── docling_parser.py # native PDF → text/layout
	│ │ └── ocr.py # image → text (optional path)
	│ ├── backends/
	│ │ ├── base.py # ExtractionBackend protocol + factory
	│ │ ├── gemini.py # free-tier multimodal adapter
	│ │ └── ollama.py # local model adapter
	│ ├── validation/
	│ │ └── rules.py # hard/soft rules → report
	│ ├── routing/
	│ │ └── score.py # confidence + decision (pure)
	│ ├── store/
	│ │ ├── db.py # SQLite writer
	│ │ └── export.py # CSV export
	│ ├── ingest/
	│ │ └── watcher.py # folder watcher / poll loop (batch entry)
	│ └── web/
	│ └── app.py # Gradio demo (URL entry)
	├── eval/
	│ ├── run_eval.py # metrics over labelled datasets
	│ └── datasets/ # download scripts / loaders (no data in git)
	├── data/ # gitignored: inbox/ processed/ review/ exports/
	│ ├── inbox/
	│ ├── processed/
	│ ├── review/
	│ └── exports/
	└── tests/
	├── test_validation.py
	├── test_routing.py
	├── test_schema.py
	└── test_core_smoke.py
	```

	## 2. Stack

	- Runtime: Python 3.11, pinned via `.python-version` (`uv python pin
	3.11`). Chosen over 3.12 for broadest wheel coverage across the Torch-based
	Docling stack and PaddleOCR/PaddlePaddle, which lags newest Pythons.
	Declared range: `requires-python = ">=3.11"`.
	- Package manager: `uv` (manages the venv, resolves and locks deps via
	`uv.lock`; add deps with `uv add`, run with `uv run`).
	- Parsing: `docling` (native PDF/scan structure). Optional OCR:
	`paddleocr` or `pytesseract` + system Tesseract.
	- Modeling: `google-genai` (Gemini free tier) and a local `ollama` server
	(e.g. `qwen2.5:7b` or a 3B variant) reached over HTTP.
	- Contract/validation: `pydantic` v2.
	- Web demo: `gradio`.
	- Storage: stdlib `sqlite3` + `csv`.
	- Watcher: `watchdog` (or a stdlib poll loop for max portability).
	- Config: `pydantic-settings` / `python-dotenv`.
	- Testing: `pytest`.

	Dependencies are declared in `pyproject.toml` and pinned via the committed
	`uv.lock` (`uv sync` installs exactly that lock). Do not float the model
	identifier in code — it is config (see guardrails).

	## 3. Configuration (`.env.example`)

	```
	# Backend selection: "gemini" \| "ollama"
	EXTRACTION_BACKEND=gemini

	# Gemini (free tier via Google AI Studio key; no card required)
	GEMINI_API_KEY=
	GEMINI_MODEL=gemini-flash-latest # identifier is config, not hardcoded

	# Ollama (local)
	OLLAMA_HOST=http://localhost:11434
	OLLAMA_MODEL=qwen2.5:7b

	# Image handling: "vision_direct" \| "ocr_then_text"
	IMAGE_STRATEGY=vision_direct # vision_direct requires a multimodal backend

	# Routing
	CONFIDENCE_THRESHOLD=0.85 # tuned via eval

	# Paths (batch mode)
	INBOX_DIR=./data/inbox
	PROCESSED_DIR=./data/processed
	REVIEW_DIR=./data/review
	EXPORT_DIR=./data/exports
	DB_PATH=./data/agent.db
	```

	`config.py` validates these at startup and fails fast with a clear message if,
	e.g., `gemini` is selected with no key, or `vision_direct` is selected with a
	text-only backend.

	## 4. Local setup

	```bash
	# 1. Pin the interpreter to 3.11 (writes .python-version; uv fetches it if absent)
	uv python pin 3.11

	# 2. Install (uv creates the venv on 3.11 and installs from pyproject.toml + uv.lock)
	uv sync

	# 3a. Gemini path: get a free AI Studio key, put it in .env
	# (free tier, no credit card; quota resets daily)

	# 3b. Ollama path (offline/private):
	# install Ollama, then:
	ollama pull qwen2.5:7b
	# set EXTRACTION_BACKEND=ollama and IMAGE_STRATEGY=ocr_then_text

	# 4. Create working dirs
	mkdir -p data/{inbox,processed,review,exports}
	```

	## 5. Running

	Autonomous batch mode:

	```bash
	uv run python -m doc_agent.ingest.watcher
	# drop files into data/inbox/ — accepted records land in SQLite + data/exports/,
	# uncertain ones move to data/review/
	```

	Web demo (local):

	```bash
	uv run python -m doc_agent.web.app
	# opens a Gradio URL; upload one document to see fields + confidence + decision
	```

	Evaluation:

	```bash
	uv run python eval/run_eval.py --dataset sroie --split holdout
	# prints per-field precision/recall/F1 and auto-accept precision on critical fields
	```

	## 6. Deployment to Hugging Face Spaces (free public demo URL)

	1. Create a new Space → SDK: Gradio (free CPU tier). Set the Space's
	Python to 3.11 (the `python_version: "3.11"` field in the Space README
	metadata) so the deployed runtime matches the pinned local interpreter.
	2. Add `app.py` at the Space root that imports and launches
	`doc_agent.web.app` (or copy the web entry there), plus a `requirements.txt`
	the Gradio builder can read — generate it from the uv-managed project rather
	than hand-maintaining it: `uv export --no-hashes --no-dev -o requirements.txt`.
	3. Set Repository secrets in the Space: `GEMINI_API_KEY`,
	`EXTRACTION_BACKEND=gemini`, `IMAGE_STRATEGY=vision_direct`,
	`GEMINI_MODEL=gemini-flash-latest`.
	4. Push; the Space builds and serves a public URL.

	Free-tier realities to design around (and to note in the UI):

	- CPU-only and the Space sleeps when idle → first request after idle has a
	cold start. This is why the cloud demo uses the Gemini API for inference
	rather than a local model, and why `vision_direct` (no heavy OCR in the
	Space) is the demo's image path.
	- Stateless: no persistent DB in the demo. Show the result; don't store it.
	- Privacy: the free Gemini tier may use inputs for training, so the demo
	must display a "synthetic/public documents only" notice and must not be used
	for real financial data.

	## 7. What stays free

	- Inference: local Ollama (no quota, private) or Gemini free tier
	(~1,500 req/day, resets daily, no card) — far above dev volume.
	- Hosting: Hugging Face Spaces free CPU tier for the public demo.
	- Storage: local SQLite/CSV; nothing paid.

	No component requires a credit card or paid plan for development or demo.