File size: 7,363 Bytes
d44b33d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
title: Document-Audit RAG
emoji: πŸ“‘
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: "1.39.0"
app_file: app.py
---

# DocuAudit AI

**DocuAudit AI** is a production-oriented FastAPI backend plus optional Streamlit UI for **multi-document RAG**: upload documents, build a Chroma vector index, ask grounded questions with citations, and retain a **SQLite audit trail** of every query.

## Architecture

```mermaid
flowchart LR
  subgraph ingest [Ingestion]
    A[PDF / TXT / MD] --> B[Loader]
    B --> C[Chunker]
    C --> D[Embedder]
    D --> E[(ChromaDB)]
  end
  subgraph query [Query path]
    Q[User question] --> R[Semantic search]
    R --> E
    R --> T[Top-K chunks]
    T --> L[LLM]
    L --> U[Answer + citations]
  end
  U --> V[(SQLite audit)]
```

ASCII equivalent:

```
PDF Upload β†’ Parser β†’ Chunker β†’ Embedder β†’ ChromaDB
                                              ↓
User Query β†’ Semantic Search β†’ Top-K Chunks β†’ LLM β†’ Answer + Citations
                                              ↓
                                       Audit Log (SQLite)
```

## Use cases

- **Litigation document analysis** β€” trace claims to exact pages and filenames.
- **Corporate finance review** β€” compare disclosures and filings under a consistent audit log.
- **Investigation support** β€” bulk ingest, async jobs, and reproducible query history.

## Deploying on Hugging Face Spaces

- Set **`LLM_PROVIDER=huggingface`**; use **`HUGGINGFACE_API_KEY`** and/or the Space secret **`HF_TOKEN`** (see [`.env.example`](.env.example)).
- Use root **`app.py`** as the Streamlit entry for the default Hub command.
- Hub UI, secrets, hardware, and Streamlit SDK details: [Streamlit Spaces](https://huggingface.co/docs/hub/spaces-sdks-streamlit), [Spaces overview](https://huggingface.co/docs/hub/spaces-overview).
- **Test locally before deploy:** `uv run python scripts/verify_huggingface_inference.py` (requires `LLM_PROVIDER=huggingface` in `.env`).

## Quick start with Docker

Requires [Docker Engine](https://docs.docker.com/engine/) and Compose v2. The snippet below matches the shipped **`docker-compose.yml`**: API on **8000**, Streamlit on **8501**, with Chroma and SQLite under **`/data`** inside the API container. After **`docker compose up -d`**, expect **`curl http://localhost:8000/health`** to return JSON including **`"status":"ok"`**.

```bash
git clone <repository-url> doc-Audi-ai
cd doc-Audi-ai
cp .env.example .env
# edit .env as needed; for compose Ollama: OLLAMA_BASE_URL=http://ollama:11434
# (with host Ollama: run `ollama serve`; compose defaults to host.docker.internal:11434)

docker compose build
docker compose up -d
curl -s http://localhost:8000/health
# http://localhost:8501 β€” Streamlit
docker compose down
```

Optional all-in-one Ollama in Compose: `docker compose --profile ollama up -d` (then set `OLLAMA_BASE_URL=http://ollama:11434` in `.env` and recreate containers).

## How it works (user workflow)

Collections, ingestion vs querying, jobs vs audit, Streamlit tabs, and **per-button UI flows**: **[docs/USER_WORKFLOW.md](docs/USER_WORKFLOW.md)**.

## Run and test (step-by-step)

For ingestion formats, URL rules, job polling, sample `sample.txt` walkthrough, curl/PowerShell examples, and troubleshooting, see **[docs/RUN_AND_TEST_GUIDE.md](docs/RUN_AND_TEST_GUIDE.md)**.

For SQLite vs Memcached, offline DB inspection, and the Cursor **SQLite Viewer** extension (`qwtel.sqlite-viewer`), see **[docs/SQLITE_AND_DB_INSPECTION.md](docs/SQLITE_AND_DB_INSPECTION.md)**.

## Quick start (local, without Docker)

Run the API with **uv** (or your preferred tool):

```bash
git clone <repository-url> doc-Audi-ai
cd doc-Audi-ai
cp .env.example .env
uv sync
ollama pull llama3.1:8b
ollama pull nomic-embed-text
uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload --reload-dir api --reload-dir storage
```

Optional UI:

```bash
uv run streamlit run streamlit_app.py --server.port 8501 --server.address 0.0.0.0
```

## API overview

| Method | Path | Description |
|--------|------|-------------|
| GET | `/health` | Liveness; returns configured app name and version |
| POST | `/ingest/upload` | Multipart **`files`** (one or more); queues background ingest job |
| POST | `/ingest/url` | JSON **`urls`** array (1–100); download and queue ingest |
| GET | `/ingest/collections` | Lists collections with **`document_count`** and optional **`created_at`** |
| DELETE | `/ingest/collection/{collection_name}` | Drops a collection; returns **`documents_removed`** |
| GET | `/jobs` | Lists jobs with **`total`** count |
| GET | `/jobs/{job_id}` | Job status with **`progress_percent`**, file counters, timestamps, **`errors`** |
| POST | `/query/ask` | Grounded answer; request includes **`top_k`**, **`user_id`** |
| POST | `/query/summarise` | Collection summary; distinct response shape (`summary`, `document_count`, …) |
| POST | `/query` | Legacy alias of **`/query/ask`** |
| GET | `/audit/logs` | Filterable audit index (`user_id`, `from_date`, `to_date`, pagination) |
| GET | `/audit/logs/{query_id}` | Full stored answer and citations for one query |

Interactive docs: `http://localhost:8000/docs`.

## Sample request and response (`POST /query/ask`)

Request:

```json
{
  "question": "What were the key risk factors identified in the Q3 2023 financial report?",
  "collection_name": "default",
  "top_k": 5,
  "user_id": "analyst_001"
}
```

Response (shape; values depend on your documents and model):

```json
{
  "query_id": "uuid-string",
  "question": "What were the key risk factors identified in the Q3 2023 financial report?",
  "answer": "… grounded text with citations …",
  "sources": [
    {
      "document_name": "q3_financial_report.pdf",
      "page_number": 12,
      "chunk_text": "Key risk factors include …",
      "relevance_score": 0.91
    }
  ],
  "model_used": "llama3.1:8b",
  "tokens_used": 0,
  "response_time_ms": 1820,
  "timestamp": "2026-05-03T12:00:00Z"
}
```

## Design decisions

- **Source citations** β€” High-stakes review requires every substantive claim to be tied to **document name** and **page** (where available), not a free-floating model monologue.
- **Auditability** β€” Each ask/summarise persists **query id**, **user id**, timing, model id, token usage (when the provider exposes it), and serialized sources so regulators or counsel can reconstruct what the system returned.

## Scale note

Architecture is designed for **high-volume document ingestion** via **async background jobs** (FastAPI `BackgroundTasks`), persistent Chroma collections, and a stateless API tier that can be replicated once you add a shared vector store and job queue.

## Tests

Automated API tests use **pytest** with isolated temp databases; they do **not** require a running server or Ollama.

```bash
uv sync
uv run pytest tests/ -q
```

Full guide (commands, coverage by file, mocks vs manual smoke tests, troubleshooting): **[docs/TESTING.md](docs/TESTING.md)**.

## Configuration

See **`.env.example`**. Common variables include `LLM_PROVIDER`, Ollama/OpenAI/Anthropic keys and models, `CHROMA_PERSIST_DIRECTORY`, `AUDIT_DB_PATH`, `JOBS_DB_PATH`, and upload limits (`MAX_FILE_SIZE_MB`; **`MAX_UPLOAD_SIZE_MB`** is accepted as an alias via settings normalization).