Spaces:

knzychw
/

document-extract-agent

Running

App Files Files Community

document-extract-agent / PROGRESS_FUTURE.md

kennethzychew

docs: update readme and add MIT license

e804596 2 days ago

preview code

Raw

History Blame Contribute Delete

6.73 kB

	# PROGRESS_FUTURE.md -- Post-Ship Backlog

	The project is shipped: full pipeline, autonomous watcher, measured eval
	(SROIE n=100), live demo on Hugging Face Spaces. Everything in this file is
	optional. It exists so any item can be picked up cold -- by you or a future
	Claude Code session -- with the same task format as the other ledgers.

	Priority rule: Tier 1 closes gaps in claims the README already makes.
	Tier 2 makes an existing claim demonstrably true. Tier 3 is genuine
	improvement with diminishing portfolio return. Work top-down; stopping after
	any tier leaves a coherent project.

	Protocol: same as PROGRESS_TOMORROW.md -- interactive, one task at a time,
	review the diff, verify on real documents where a model is touched, one commit
	per task, tick the box. Specs in `docs/` remain the source of truth; do not
	edit them casually. If an item needs design beyond what the docs cover, write
	the design when the item is next, not before.

	---

	## Tier 1 -- Close the measurement gap (do these first)

	The README claims three critical fields (`total`, `tax`, `invoice_number`) but
	only `total` is measured -- SROIE doesn't label the other two. These finish the
	evidence story. The harness, cache, and sweep already exist; each task is one
	adapter + one eval run.

	- [ ] F1 -- Wire the CORD adapter (measures `tax` + line items)
	Implement the scaffolded `eval/datasets/cord.py` against
	`naver-clova-ix/cord-v2`: parse the `ground_truth` JSON string, map
	`gt_parse` per the mapping documented in the scaffold (subtotal, tax, total,
	line_items, vendor where present). Labeled fields = only what CORD labels.
	Note: CORD receipts are Indonesian -- expect lower text-field accuracy; that
	is signal about multilingual behaviour, not a harness bug. Run predict on a
	SMALL slice (20) first, then 100. Costs Gemini quota -- run deliberately.
	Check: `uv run python -m eval.run_eval predict --dataset cord --limit 20`
	then `score --dataset cord` produces tables; tests still offline-green.
	Commit: `eval: wire CORD adapter (tax + line-item coverage)`

	- [ ] F2 -- Wire the invoice-JSON adapter (measures `invoice_number`)
	Implement `eval/datasets/invoice_json.py` against
	`mychen76/invoices-and-receipts_ocr_v1` (or `GokulRajaR/invoice-ocr-json`
	if the shape is cleaner -- probe both, pick one, document why). Map invoice
	number, dates, totals, tax per the scaffold. Same small-slice-first rule.
	Check: predict (20) + score produce tables including `invoice_number`.
	Commit: `eval: wire invoice-JSON adapter (invoice_number coverage)`

	- [ ] F3 -- Update the README results section
	Extend the results table to all three datasets; update the framing to state
	which critical fields are measured where; refresh the auto-accept precision
	claim if the numbers move it. Keep the honest caveats (confidence ceiling,
	slice sizes).
	Check: README table covers total/tax/invoice_number with dataset provenance.
	Commit: `docs: eval results across SROIE + CORD + invoices`

	## Tier 2 -- Make the offline claim true (T4 + T6, deferred from launch)

	The swappable-backend design currently has one real backend. These make
	"runs fully free, offline, and private" demonstrable rather than aspirational.

	- [ ] F4 -- OCR path (build plan 2.3; ledger T4)
	`src/doc_agent/parsing/ocr.py` behind the payload interface; wire into
	`acquire` for `IMAGE_STRATEGY=ocr_then_text`.
	DECISION: try `uv add paddleocr`; if it won't resolve on 3.11, fall back to
	`uv add pytesseract` + the Tesseract binary, and record the choice here.
	Check: a sample receipt image yields text; `process_document` runs in
	`ocr_then_text` mode with the stub backend (no model needed to test the path).
	Commit: `phase 2.3: OCR acquire path (ocr_then_text)`

	- [ ] F5 -- Ollama backend (build plan 2.6; ledger T6)
	Requires a local Ollama server + pulled model (e.g. `qwen2.5:7b`).
	`src/doc_agent/backends/ollama.py`: JSON-schema/grammar-constrained
	decoding, text-in (pairs with F4), registered in the factory, model id from
	config. Mocked unit tests + one manual smoke against the live server.
	Check: `EXTRACTION_BACKEND=ollama` + `IMAGE_STRATEGY=ocr_then_text` returns
	schema-valid data on a real receipt, fully offline.
	Commit: `phase 2.6: ollama backend (local/offline path)`

	- [ ] F6 -- Offline eval comparison (small, high-signal)
	Run the SROIE 20-slice through the Ollama path and add a one-row comparison
	to the README (Gemini vs local 7B on the same slice). This is the concrete
	payoff of the swappable design: same harness, two backends, honest numbers.
	Check: comparison row in README with slice size stated.
	Commit: `eval: gemini vs ollama comparison (SROIE-20)`

	## Tier 3 -- Genuine improvements, diminishing portfolio returns

	Defensible engineering; none changes how the project reads to a reviewer.
	Pick by interest, not obligation.

	- [ ] F7 -- Real confidence signal. Surface a usable model signal
	(logprobs where the API exposes them, or k-sample self-consistency voting)
	so `CONFIDENCE_THRESHOLD` becomes a live dial; re-run the sweep and update
	the README (this would retire the "confidence ceiling" caveat). Design
	needed before building: self-consistency multiplies per-document cost by k.
	- [ ] F8 -- Review-queue UI. A minimal local page over `review/`: show the
	document, the extraction, the validation failures; accept-with-edits writes
	to the store. Keeps the "not a product" scope -- single user, no auth.
	- [ ] F9 -- Watcher hardening. Bounded retries with backoff for transient
	backend failures, a dead-letter state distinct from review, and a startup
	reconciliation pass over files that arrived while the watcher was down.
	- [ ] F10 -- Second document domain. One new document type (e.g. utility
	bills or purchase orders): schema fields, validation rules, a small labeled
	eval slice. Proves the architecture generalizes beyond receipts/invoices.

	## Not doing, and why

	Explicit non-goals -- declining these is a design decision, not an omission:

	- Fine-tuning a model. The project's thesis is engineering around
	off-the-shelf models; fine-tuning is a different project and would compete
	on the one axis (benchmark F1) where purpose-built models win.
	- Multi-tenant / production deployment. Auth, queues, horizontal scale,
	SLAs -- out of scope per requirements section 4; the Space is a demo, not a service.
	- A full review application. F8 stays a single-user local page; workflow
	tooling, audit trails, and roles are product work, not portfolio work.
	- Chasing SROIE/CORD leaderboards. The datasets are the measuring
	instrument, not the objective (see README framing).