Spaces:

mayankchugh-learning
/

Document-Audit-RAG

Sleeping

File size: 7,363 Bytes

d44b33d

---
title: Document-Audit RAG
emoji: 📑
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: "1.39.0"
app_file: app.py
---

# DocuAudit AI

**DocuAudit AI** is a production-oriented FastAPI backend plus optional Streamlit UI for **multi-document RAG**: upload documents, build a Chroma vector index, ask grounded questions with citations, and retain a **SQLite audit trail** of every query.

## Architecture

```mermaid
flowchart LR
  subgraph ingest [Ingestion]
    A[PDF / TXT / MD] --> B[Loader]
    B --> C[Chunker]
    C --> D[Embedder]
    D --> E[(ChromaDB)]
  end
  subgraph query [Query path]
    Q[User question] --> R[Semantic search]
    R --> E
    R --> T[Top-K chunks]
    T --> L[LLM]
    L --> U[Answer + citations]
  end
  U --> V[(SQLite audit)]
```

ASCII equivalent:

```
PDF Upload → Parser → Chunker → Embedder → ChromaDB
                                              ↓
User Query → Semantic Search → Top-K Chunks → LLM → Answer + Citations
                                              ↓
                                       Audit Log (SQLite)
```

## Use cases

- **Litigation document analysis** — trace claims to exact pages and filenames.
- **Corporate finance review** — compare disclosures and filings under a consistent audit log.
- **Investigation support** — bulk ingest, async jobs, and reproducible query history.

## Deploying on Hugging Face Spaces

- Set **`LLM_PROVIDER=huggingface`**; use **`HUGGINGFACE_API_KEY`** and/or the Space secret **`HF_TOKEN`** (see [`.env.example`](.env.example)).
- Use root **`app.py`** as the Streamlit entry for the default Hub command.
- Hub UI, secrets, hardware, and Streamlit SDK details: [Streamlit Spaces](https://huggingface.co/docs/hub/spaces-sdks-streamlit), [Spaces overview](https://huggingface.co/docs/hub/spaces-overview).
- **Test locally before deploy:** `uv run python scripts/verify_huggingface_inference.py` (requires `LLM_PROVIDER=huggingface` in `.env`).

## Quick start with Docker

Requires [Docker Engine](https://docs.docker.com/engine/) and Compose v2. The snippet below matches the shipped **`docker-compose.yml`**: API on **8000**, Streamlit on **8501**, with Chroma and SQLite under **`/data`** inside the API container. After **`docker compose up -d`**, expect **`curl http://localhost:8000/health`** to return JSON including **`"status":"ok"`**.

```bash
git clone <repository-url> doc-Audi-ai
cd doc-Audi-ai
cp .env.example .env
# edit .env as needed; for compose Ollama: OLLAMA_BASE_URL=http://ollama:11434
# (with host Ollama: run `ollama serve`; compose defaults to host.docker.internal:11434)

docker compose build
docker compose up -d
curl -s http://localhost:8000/health
# http://localhost:8501 — Streamlit
docker compose down
```

Optional all-in-one Ollama in Compose: `docker compose --profile ollama up -d` (then set `OLLAMA_BASE_URL=http://ollama:11434` in `.env` and recreate containers).

## How it works (user workflow)

Collections, ingestion vs querying, jobs vs audit, Streamlit tabs, and **per-button UI flows**: **[docs/USER_WORKFLOW.md](docs/USER_WORKFLOW.md)**.

## Run and test (step-by-step)

For ingestion formats, URL rules, job polling, sample `sample.txt` walkthrough, curl/PowerShell examples, and troubleshooting, see **[docs/RUN_AND_TEST_GUIDE.md](docs/RUN_AND_TEST_GUIDE.md)**.

For SQLite vs Memcached, offline DB inspection, and the Cursor **SQLite Viewer** extension (`qwtel.sqlite-viewer`), see **[docs/SQLITE_AND_DB_INSPECTION.md](docs/SQLITE_AND_DB_INSPECTION.md)**.

## Quick start (local, without Docker)

Run the API with **uv** (or your preferred tool):

```bash
git clone <repository-url> doc-Audi-ai
cd doc-Audi-ai
cp .env.example .env
uv sync
ollama pull llama3.1:8b
ollama pull nomic-embed-text
uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload --reload-dir api --reload-dir storage
```

Optional UI:

```bash
uv run streamlit run streamlit_app.py --server.port 8501 --server.address 0.0.0.0
```

## API overview

| Method | Path | Description |
|--------|------|-------------|
| GET | `/health` | Liveness; returns configured app name and version |
| POST | `/ingest/upload` | Multipart **`files`** (one or more); queues background ingest job |
| POST | `/ingest/url` | JSON **`urls`** array (1–100); download and queue ingest |
| GET | `/ingest/collections` | Lists collections with **`document_count`** and optional **`created_at`** |
| DELETE | `/ingest/collection/{collection_name}` | Drops a collection; returns **`documents_removed`** |
| GET | `/jobs` | Lists jobs with **`total`** count |
| GET | `/jobs/{job_id}` | Job status with **`progress_percent`**, file counters, timestamps, **`errors`** |
| POST | `/query/ask` | Grounded answer; request includes **`top_k`**, **`user_id`** |
| POST | `/query/summarise` | Collection summary; distinct response shape (`summary`, `document_count`, …) |
| POST | `/query` | Legacy alias of **`/query/ask`** |
| GET | `/audit/logs` | Filterable audit index (`user_id`, `from_date`, `to_date`, pagination) |
| GET | `/audit/logs/{query_id}` | Full stored answer and citations for one query |

Interactive docs: `http://localhost:8000/docs`.

## Sample request and response (`POST /query/ask`)

Request:

```json
{
  "question": "What were the key risk factors identified in the Q3 2023 financial report?",
  "collection_name": "default",
  "top_k": 5,
  "user_id": "analyst_001"
}
```

Response (shape; values depend on your documents and model):

```json
{
  "query_id": "uuid-string",
  "question": "What were the key risk factors identified in the Q3 2023 financial report?",
  "answer": "… grounded text with citations …",
  "sources": [
    {
      "document_name": "q3_financial_report.pdf",
      "page_number": 12,
      "chunk_text": "Key risk factors include …",
      "relevance_score": 0.91
    }
  ],
  "model_used": "llama3.1:8b",
  "tokens_used": 0,
  "response_time_ms": 1820,
  "timestamp": "2026-05-03T12:00:00Z"
}
```

## Design decisions

- **Source citations** — High-stakes review requires every substantive claim to be tied to **document name** and **page** (where available), not a free-floating model monologue.
- **Auditability** — Each ask/summarise persists **query id**, **user id**, timing, model id, token usage (when the provider exposes it), and serialized sources so regulators or counsel can reconstruct what the system returned.

## Scale note

Architecture is designed for **high-volume document ingestion** via **async background jobs** (FastAPI `BackgroundTasks`), persistent Chroma collections, and a stateless API tier that can be replicated once you add a shared vector store and job queue.

## Tests

Automated API tests use **pytest** with isolated temp databases; they do **not** require a running server or Ollama.

```bash
uv sync
uv run pytest tests/ -q
```

Full guide (commands, coverage by file, mocks vs manual smoke tests, troubleshooting): **[docs/TESTING.md](docs/TESTING.md)**.

## Configuration

See **`.env.example`**. Common variables include `LLM_PROVIDER`, Ollama/OpenAI/Anthropic keys and models, `CHROMA_PERSIST_DIRECTORY`, `AUDIT_DB_PATH`, `JOBS_DB_PATH`, and upload limits (`MAX_FILE_SIZE_MB`; **`MAX_UPLOAD_SIZE_MB`** is accepted as an alias via settings normalization).