| --- |
| title: Document-Audit RAG |
| emoji: π |
| colorFrom: blue |
| colorTo: indigo |
| sdk: streamlit |
| sdk_version: "1.39.0" |
| app_file: app.py |
| --- |
| |
| # DocuAudit AI |
|
|
| **DocuAudit AI** is a production-oriented FastAPI backend plus optional Streamlit UI for **multi-document RAG**: upload documents, build a Chroma vector index, ask grounded questions with citations, and retain a **SQLite audit trail** of every query. |
|
|
| ## Architecture |
|
|
| ```mermaid |
| flowchart LR |
| subgraph ingest [Ingestion] |
| A[PDF / TXT / MD] --> B[Loader] |
| B --> C[Chunker] |
| C --> D[Embedder] |
| D --> E[(ChromaDB)] |
| end |
| subgraph query [Query path] |
| Q[User question] --> R[Semantic search] |
| R --> E |
| R --> T[Top-K chunks] |
| T --> L[LLM] |
| L --> U[Answer + citations] |
| end |
| U --> V[(SQLite audit)] |
| ``` |
|
|
| ASCII equivalent: |
|
|
| ``` |
| PDF Upload β Parser β Chunker β Embedder β ChromaDB |
| β |
| User Query β Semantic Search β Top-K Chunks β LLM β Answer + Citations |
| β |
| Audit Log (SQLite) |
| ``` |
|
|
| ## Use cases |
|
|
| - **Litigation document analysis** β trace claims to exact pages and filenames. |
| - **Corporate finance review** β compare disclosures and filings under a consistent audit log. |
| - **Investigation support** β bulk ingest, async jobs, and reproducible query history. |
|
|
| ## Deploying on Hugging Face Spaces |
|
|
| - Set **`LLM_PROVIDER=huggingface`**; use **`HUGGINGFACE_API_KEY`** and/or the Space secret **`HF_TOKEN`** (see [`.env.example`](.env.example)). |
| - Use root **`app.py`** as the Streamlit entry for the default Hub command. |
| - Hub UI, secrets, hardware, and Streamlit SDK details: [Streamlit Spaces](https://huggingface.co/docs/hub/spaces-sdks-streamlit), [Spaces overview](https://huggingface.co/docs/hub/spaces-overview). |
| - **Test locally before deploy:** `uv run python scripts/verify_huggingface_inference.py` (requires `LLM_PROVIDER=huggingface` in `.env`). |
|
|
| ## Quick start with Docker |
|
|
| Requires [Docker Engine](https://docs.docker.com/engine/) and Compose v2. The snippet below matches the shipped **`docker-compose.yml`**: API on **8000**, Streamlit on **8501**, with Chroma and SQLite under **`/data`** inside the API container. After **`docker compose up -d`**, expect **`curl http://localhost:8000/health`** to return JSON including **`"status":"ok"`**. |
|
|
| ```bash |
| git clone <repository-url> doc-Audi-ai |
| cd doc-Audi-ai |
| cp .env.example .env |
| # edit .env as needed; for compose Ollama: OLLAMA_BASE_URL=http://ollama:11434 |
| # (with host Ollama: run `ollama serve`; compose defaults to host.docker.internal:11434) |
| |
| docker compose build |
| docker compose up -d |
| curl -s http://localhost:8000/health |
| # http://localhost:8501 β Streamlit |
| docker compose down |
| ``` |
|
|
| Optional all-in-one Ollama in Compose: `docker compose --profile ollama up -d` (then set `OLLAMA_BASE_URL=http://ollama:11434` in `.env` and recreate containers). |
|
|
| ## How it works (user workflow) |
|
|
| Collections, ingestion vs querying, jobs vs audit, Streamlit tabs, and **per-button UI flows**: **[docs/USER_WORKFLOW.md](docs/USER_WORKFLOW.md)**. |
|
|
| ## Run and test (step-by-step) |
|
|
| For ingestion formats, URL rules, job polling, sample `sample.txt` walkthrough, curl/PowerShell examples, and troubleshooting, see **[docs/RUN_AND_TEST_GUIDE.md](docs/RUN_AND_TEST_GUIDE.md)**. |
|
|
| For SQLite vs Memcached, offline DB inspection, and the Cursor **SQLite Viewer** extension (`qwtel.sqlite-viewer`), see **[docs/SQLITE_AND_DB_INSPECTION.md](docs/SQLITE_AND_DB_INSPECTION.md)**. |
|
|
| ## Quick start (local, without Docker) |
|
|
| Run the API with **uv** (or your preferred tool): |
|
|
| ```bash |
| git clone <repository-url> doc-Audi-ai |
| cd doc-Audi-ai |
| cp .env.example .env |
| uv sync |
| ollama pull llama3.1:8b |
| ollama pull nomic-embed-text |
| uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload |
| |
| uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload --reload-dir api --reload-dir storage |
| ``` |
|
|
| Optional UI: |
|
|
| ```bash |
| uv run streamlit run streamlit_app.py --server.port 8501 --server.address 0.0.0.0 |
| ``` |
|
|
| ## API overview |
|
|
| | Method | Path | Description | |
| |--------|------|-------------| |
| | GET | `/health` | Liveness; returns configured app name and version | |
| | POST | `/ingest/upload` | Multipart **`files`** (one or more); queues background ingest job | |
| | POST | `/ingest/url` | JSON **`urls`** array (1β100); download and queue ingest | |
| | GET | `/ingest/collections` | Lists collections with **`document_count`** and optional **`created_at`** | |
| | DELETE | `/ingest/collection/{collection_name}` | Drops a collection; returns **`documents_removed`** | |
| | GET | `/jobs` | Lists jobs with **`total`** count | |
| | GET | `/jobs/{job_id}` | Job status with **`progress_percent`**, file counters, timestamps, **`errors`** | |
| | POST | `/query/ask` | Grounded answer; request includes **`top_k`**, **`user_id`** | |
| | POST | `/query/summarise` | Collection summary; distinct response shape (`summary`, `document_count`, β¦) | |
| | POST | `/query` | Legacy alias of **`/query/ask`** | |
| | GET | `/audit/logs` | Filterable audit index (`user_id`, `from_date`, `to_date`, pagination) | |
| | GET | `/audit/logs/{query_id}` | Full stored answer and citations for one query | |
|
|
| Interactive docs: `http://localhost:8000/docs`. |
|
|
| ## Sample request and response (`POST /query/ask`) |
|
|
| Request: |
|
|
| ```json |
| { |
| "question": "What were the key risk factors identified in the Q3 2023 financial report?", |
| "collection_name": "default", |
| "top_k": 5, |
| "user_id": "analyst_001" |
| } |
| ``` |
|
|
| Response (shape; values depend on your documents and model): |
|
|
| ```json |
| { |
| "query_id": "uuid-string", |
| "question": "What were the key risk factors identified in the Q3 2023 financial report?", |
| "answer": "β¦ grounded text with citations β¦", |
| "sources": [ |
| { |
| "document_name": "q3_financial_report.pdf", |
| "page_number": 12, |
| "chunk_text": "Key risk factors include β¦", |
| "relevance_score": 0.91 |
| } |
| ], |
| "model_used": "llama3.1:8b", |
| "tokens_used": 0, |
| "response_time_ms": 1820, |
| "timestamp": "2026-05-03T12:00:00Z" |
| } |
| ``` |
|
|
| ## Design decisions |
|
|
| - **Source citations** β High-stakes review requires every substantive claim to be tied to **document name** and **page** (where available), not a free-floating model monologue. |
| - **Auditability** β Each ask/summarise persists **query id**, **user id**, timing, model id, token usage (when the provider exposes it), and serialized sources so regulators or counsel can reconstruct what the system returned. |
|
|
| ## Scale note |
|
|
| Architecture is designed for **high-volume document ingestion** via **async background jobs** (FastAPI `BackgroundTasks`), persistent Chroma collections, and a stateless API tier that can be replicated once you add a shared vector store and job queue. |
|
|
| ## Tests |
|
|
| Automated API tests use **pytest** with isolated temp databases; they do **not** require a running server or Ollama. |
|
|
| ```bash |
| uv sync |
| uv run pytest tests/ -q |
| ``` |
|
|
| Full guide (commands, coverage by file, mocks vs manual smoke tests, troubleshooting): **[docs/TESTING.md](docs/TESTING.md)**. |
|
|
| ## Configuration |
|
|
| See **`.env.example`**. Common variables include `LLM_PROVIDER`, Ollama/OpenAI/Anthropic keys and models, `CHROMA_PERSIST_DIRECTORY`, `AUDIT_DB_PATH`, `JOBS_DB_PATH`, and upload limits (`MAX_FILE_SIZE_MB`; **`MAX_UPLOAD_SIZE_MB`** is accepted as an alias via settings normalization). |
| |