File size: 7,363 Bytes
d44b33d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | ---
title: Document-Audit RAG
emoji: π
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: "1.39.0"
app_file: app.py
---
# DocuAudit AI
**DocuAudit AI** is a production-oriented FastAPI backend plus optional Streamlit UI for **multi-document RAG**: upload documents, build a Chroma vector index, ask grounded questions with citations, and retain a **SQLite audit trail** of every query.
## Architecture
```mermaid
flowchart LR
subgraph ingest [Ingestion]
A[PDF / TXT / MD] --> B[Loader]
B --> C[Chunker]
C --> D[Embedder]
D --> E[(ChromaDB)]
end
subgraph query [Query path]
Q[User question] --> R[Semantic search]
R --> E
R --> T[Top-K chunks]
T --> L[LLM]
L --> U[Answer + citations]
end
U --> V[(SQLite audit)]
```
ASCII equivalent:
```
PDF Upload β Parser β Chunker β Embedder β ChromaDB
β
User Query β Semantic Search β Top-K Chunks β LLM β Answer + Citations
β
Audit Log (SQLite)
```
## Use cases
- **Litigation document analysis** β trace claims to exact pages and filenames.
- **Corporate finance review** β compare disclosures and filings under a consistent audit log.
- **Investigation support** β bulk ingest, async jobs, and reproducible query history.
## Deploying on Hugging Face Spaces
- Set **`LLM_PROVIDER=huggingface`**; use **`HUGGINGFACE_API_KEY`** and/or the Space secret **`HF_TOKEN`** (see [`.env.example`](.env.example)).
- Use root **`app.py`** as the Streamlit entry for the default Hub command.
- Hub UI, secrets, hardware, and Streamlit SDK details: [Streamlit Spaces](https://huggingface.co/docs/hub/spaces-sdks-streamlit), [Spaces overview](https://huggingface.co/docs/hub/spaces-overview).
- **Test locally before deploy:** `uv run python scripts/verify_huggingface_inference.py` (requires `LLM_PROVIDER=huggingface` in `.env`).
## Quick start with Docker
Requires [Docker Engine](https://docs.docker.com/engine/) and Compose v2. The snippet below matches the shipped **`docker-compose.yml`**: API on **8000**, Streamlit on **8501**, with Chroma and SQLite under **`/data`** inside the API container. After **`docker compose up -d`**, expect **`curl http://localhost:8000/health`** to return JSON including **`"status":"ok"`**.
```bash
git clone <repository-url> doc-Audi-ai
cd doc-Audi-ai
cp .env.example .env
# edit .env as needed; for compose Ollama: OLLAMA_BASE_URL=http://ollama:11434
# (with host Ollama: run `ollama serve`; compose defaults to host.docker.internal:11434)
docker compose build
docker compose up -d
curl -s http://localhost:8000/health
# http://localhost:8501 β Streamlit
docker compose down
```
Optional all-in-one Ollama in Compose: `docker compose --profile ollama up -d` (then set `OLLAMA_BASE_URL=http://ollama:11434` in `.env` and recreate containers).
## How it works (user workflow)
Collections, ingestion vs querying, jobs vs audit, Streamlit tabs, and **per-button UI flows**: **[docs/USER_WORKFLOW.md](docs/USER_WORKFLOW.md)**.
## Run and test (step-by-step)
For ingestion formats, URL rules, job polling, sample `sample.txt` walkthrough, curl/PowerShell examples, and troubleshooting, see **[docs/RUN_AND_TEST_GUIDE.md](docs/RUN_AND_TEST_GUIDE.md)**.
For SQLite vs Memcached, offline DB inspection, and the Cursor **SQLite Viewer** extension (`qwtel.sqlite-viewer`), see **[docs/SQLITE_AND_DB_INSPECTION.md](docs/SQLITE_AND_DB_INSPECTION.md)**.
## Quick start (local, without Docker)
Run the API with **uv** (or your preferred tool):
```bash
git clone <repository-url> doc-Audi-ai
cd doc-Audi-ai
cp .env.example .env
uv sync
ollama pull llama3.1:8b
ollama pull nomic-embed-text
uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload --reload-dir api --reload-dir storage
```
Optional UI:
```bash
uv run streamlit run streamlit_app.py --server.port 8501 --server.address 0.0.0.0
```
## API overview
| Method | Path | Description |
|--------|------|-------------|
| GET | `/health` | Liveness; returns configured app name and version |
| POST | `/ingest/upload` | Multipart **`files`** (one or more); queues background ingest job |
| POST | `/ingest/url` | JSON **`urls`** array (1β100); download and queue ingest |
| GET | `/ingest/collections` | Lists collections with **`document_count`** and optional **`created_at`** |
| DELETE | `/ingest/collection/{collection_name}` | Drops a collection; returns **`documents_removed`** |
| GET | `/jobs` | Lists jobs with **`total`** count |
| GET | `/jobs/{job_id}` | Job status with **`progress_percent`**, file counters, timestamps, **`errors`** |
| POST | `/query/ask` | Grounded answer; request includes **`top_k`**, **`user_id`** |
| POST | `/query/summarise` | Collection summary; distinct response shape (`summary`, `document_count`, β¦) |
| POST | `/query` | Legacy alias of **`/query/ask`** |
| GET | `/audit/logs` | Filterable audit index (`user_id`, `from_date`, `to_date`, pagination) |
| GET | `/audit/logs/{query_id}` | Full stored answer and citations for one query |
Interactive docs: `http://localhost:8000/docs`.
## Sample request and response (`POST /query/ask`)
Request:
```json
{
"question": "What were the key risk factors identified in the Q3 2023 financial report?",
"collection_name": "default",
"top_k": 5,
"user_id": "analyst_001"
}
```
Response (shape; values depend on your documents and model):
```json
{
"query_id": "uuid-string",
"question": "What were the key risk factors identified in the Q3 2023 financial report?",
"answer": "β¦ grounded text with citations β¦",
"sources": [
{
"document_name": "q3_financial_report.pdf",
"page_number": 12,
"chunk_text": "Key risk factors include β¦",
"relevance_score": 0.91
}
],
"model_used": "llama3.1:8b",
"tokens_used": 0,
"response_time_ms": 1820,
"timestamp": "2026-05-03T12:00:00Z"
}
```
## Design decisions
- **Source citations** β High-stakes review requires every substantive claim to be tied to **document name** and **page** (where available), not a free-floating model monologue.
- **Auditability** β Each ask/summarise persists **query id**, **user id**, timing, model id, token usage (when the provider exposes it), and serialized sources so regulators or counsel can reconstruct what the system returned.
## Scale note
Architecture is designed for **high-volume document ingestion** via **async background jobs** (FastAPI `BackgroundTasks`), persistent Chroma collections, and a stateless API tier that can be replicated once you add a shared vector store and job queue.
## Tests
Automated API tests use **pytest** with isolated temp databases; they do **not** require a running server or Ollama.
```bash
uv sync
uv run pytest tests/ -q
```
Full guide (commands, coverage by file, mocks vs manual smoke tests, troubleshooting): **[docs/TESTING.md](docs/TESTING.md)**.
## Configuration
See **`.env.example`**. Common variables include `LLM_PROVIDER`, Ollama/OpenAI/Anthropic keys and models, `CHROMA_PERSIST_DIRECTORY`, `AUDIT_DB_PATH`, `JOBS_DB_PATH`, and upload limits (`MAX_FILE_SIZE_MB`; **`MAX_UPLOAD_SIZE_MB`** is accepted as an alias via settings normalization).
|