Spaces:
Running
Running
| # PROGRESS_FUTURE.md -- Post-Ship Backlog | |
| The project is shipped: full pipeline, autonomous watcher, measured eval | |
| (SROIE n=100), live demo on Hugging Face Spaces. Everything in this file is | |
| optional. It exists so any item can be picked up cold -- by you or a future | |
| Claude Code session -- with the same task format as the other ledgers. | |
| **Priority rule:** Tier 1 closes gaps in claims the README already makes. | |
| Tier 2 makes an existing claim demonstrably true. Tier 3 is genuine | |
| improvement with diminishing portfolio return. Work top-down; stopping after | |
| any tier leaves a coherent project. | |
| **Protocol:** same as PROGRESS_TOMORROW.md -- interactive, one task at a time, | |
| review the diff, verify on real documents where a model is touched, one commit | |
| per task, tick the box. Specs in `docs/` remain the source of truth; do not | |
| edit them casually. If an item needs design beyond what the docs cover, write | |
| the design *when the item is next*, not before. | |
| --- | |
| ## Tier 1 -- Close the measurement gap (do these first) | |
| The README claims three critical fields (`total`, `tax`, `invoice_number`) but | |
| only `total` is measured -- SROIE doesn't label the other two. These finish the | |
| evidence story. The harness, cache, and sweep already exist; each task is one | |
| adapter + one eval run. | |
| - [ ] **F1 -- Wire the CORD adapter** (measures `tax` + line items) | |
| Implement the scaffolded `eval/datasets/cord.py` against | |
| `naver-clova-ix/cord-v2`: parse the `ground_truth` JSON string, map | |
| `gt_parse` per the mapping documented in the scaffold (subtotal, tax, total, | |
| line_items, vendor where present). Labeled fields = only what CORD labels. | |
| Note: CORD receipts are Indonesian -- expect lower text-field accuracy; that | |
| is signal about multilingual behaviour, not a harness bug. Run predict on a | |
| SMALL slice (20) first, then 100. Costs Gemini quota -- run deliberately. | |
| Check: `uv run python -m eval.run_eval predict --dataset cord --limit 20` | |
| then `score --dataset cord` produces tables; tests still offline-green. | |
| Commit: `eval: wire CORD adapter (tax + line-item coverage)` | |
| - [ ] **F2 -- Wire the invoice-JSON adapter** (measures `invoice_number`) | |
| Implement `eval/datasets/invoice_json.py` against | |
| `mychen76/invoices-and-receipts_ocr_v1` (or `GokulRajaR/invoice-ocr-json` | |
| if the shape is cleaner -- probe both, pick one, document why). Map invoice | |
| number, dates, totals, tax per the scaffold. Same small-slice-first rule. | |
| Check: predict (20) + score produce tables including `invoice_number`. | |
| Commit: `eval: wire invoice-JSON adapter (invoice_number coverage)` | |
| - [ ] **F3 -- Update the README results section** | |
| Extend the results table to all three datasets; update the framing to state | |
| which critical fields are measured where; refresh the auto-accept precision | |
| claim if the numbers move it. Keep the honest caveats (confidence ceiling, | |
| slice sizes). | |
| Check: README table covers total/tax/invoice_number with dataset provenance. | |
| Commit: `docs: eval results across SROIE + CORD + invoices` | |
| ## Tier 2 -- Make the offline claim true (T4 + T6, deferred from launch) | |
| The swappable-backend design currently has one real backend. These make | |
| "runs fully free, offline, and private" demonstrable rather than aspirational. | |
| - [ ] **F4 -- OCR path** (build plan 2.3; ledger T4) | |
| `src/doc_agent/parsing/ocr.py` behind the payload interface; wire into | |
| `acquire` for `IMAGE_STRATEGY=ocr_then_text`. | |
| DECISION: try `uv add paddleocr`; if it won't resolve on 3.11, fall back to | |
| `uv add pytesseract` + the Tesseract binary, and record the choice here. | |
| Check: a sample receipt image yields text; `process_document` runs in | |
| `ocr_then_text` mode with the stub backend (no model needed to test the path). | |
| Commit: `phase 2.3: OCR acquire path (ocr_then_text)` | |
| - [ ] **F5 -- Ollama backend** (build plan 2.6; ledger T6) | |
| Requires a local Ollama server + pulled model (e.g. `qwen2.5:7b`). | |
| `src/doc_agent/backends/ollama.py`: JSON-schema/grammar-constrained | |
| decoding, text-in (pairs with F4), registered in the factory, model id from | |
| config. Mocked unit tests + one manual smoke against the live server. | |
| Check: `EXTRACTION_BACKEND=ollama` + `IMAGE_STRATEGY=ocr_then_text` returns | |
| schema-valid data on a real receipt, fully offline. | |
| Commit: `phase 2.6: ollama backend (local/offline path)` | |
| - [ ] **F6 -- Offline eval comparison** (small, high-signal) | |
| Run the SROIE 20-slice through the Ollama path and add a one-row comparison | |
| to the README (Gemini vs local 7B on the same slice). This is the concrete | |
| payoff of the swappable design: same harness, two backends, honest numbers. | |
| Check: comparison row in README with slice size stated. | |
| Commit: `eval: gemini vs ollama comparison (SROIE-20)` | |
| ## Tier 3 -- Genuine improvements, diminishing portfolio returns | |
| Defensible engineering; none changes how the project reads to a reviewer. | |
| Pick by interest, not obligation. | |
| - [ ] **F7 -- Real confidence signal.** Surface a usable model signal | |
| (logprobs where the API exposes them, or k-sample self-consistency voting) | |
| so `CONFIDENCE_THRESHOLD` becomes a live dial; re-run the sweep and update | |
| the README (this would retire the "confidence ceiling" caveat). Design | |
| needed before building: self-consistency multiplies per-document cost by k. | |
| - [ ] **F8 -- Review-queue UI.** A minimal local page over `review/`: show the | |
| document, the extraction, the validation failures; accept-with-edits writes | |
| to the store. Keeps the "not a product" scope -- single user, no auth. | |
| - [ ] **F9 -- Watcher hardening.** Bounded retries with backoff for transient | |
| backend failures, a dead-letter state distinct from review, and a startup | |
| reconciliation pass over files that arrived while the watcher was down. | |
| - [ ] **F10 -- Second document domain.** One new document type (e.g. utility | |
| bills or purchase orders): schema fields, validation rules, a small labeled | |
| eval slice. Proves the architecture generalizes beyond receipts/invoices. | |
| ## Not doing, and why | |
| Explicit non-goals -- declining these is a design decision, not an omission: | |
| - **Fine-tuning a model.** The project's thesis is engineering *around* | |
| off-the-shelf models; fine-tuning is a different project and would compete | |
| on the one axis (benchmark F1) where purpose-built models win. | |
| - **Multi-tenant / production deployment.** Auth, queues, horizontal scale, | |
| SLAs -- out of scope per requirements section 4; the Space is a demo, not a service. | |
| - **A full review application.** F8 stays a single-user local page; workflow | |
| tooling, audit trails, and roles are product work, not portfolio work. | |
| - **Chasing SROIE/CORD leaderboards.** The datasets are the measuring | |
| instrument, not the objective (see README framing). | |