Spaces:

mayankchugh-learning
/

Document-Audit-RAG

Running

App Files Files Community

Document-Audit-RAG / README.md

Mayank Chugh

Deploy DocuAudit AI to Hugging Face Space (no binaries)

d44b33d 5 days ago

preview code

raw

history blame contribute delete

7.36 kB

	---
	title: Document-Audit RAG
	emoji: 📑
	colorFrom: blue
	colorTo: indigo
	sdk: streamlit
	sdk_version: "1.39.0"
	app_file: app.py
	---

	# DocuAudit AI

	DocuAudit AI is a production-oriented FastAPI backend plus optional Streamlit UI for multi-document RAG: upload documents, build a Chroma vector index, ask grounded questions with citations, and retain a SQLite audit trail of every query.

	## Architecture

	```mermaid
	flowchart LR
	subgraph ingest [Ingestion]
	A[PDF / TXT / MD] --> B[Loader]
	B --> C[Chunker]
	C --> D[Embedder]
	D --> E[(ChromaDB)]
	end
	subgraph query [Query path]
	Q[User question] --> R[Semantic search]
	R --> E
	R --> T[Top-K chunks]
	T --> L[LLM]
	L --> U[Answer + citations]
	end
	U --> V[(SQLite audit)]
	```

	ASCII equivalent:

	```
	PDF Upload → Parser → Chunker → Embedder → ChromaDB
	↓
	User Query → Semantic Search → Top-K Chunks → LLM → Answer + Citations
	↓
	Audit Log (SQLite)
	```

	## Use cases

	- Litigation document analysis — trace claims to exact pages and filenames.
	- Corporate finance review — compare disclosures and filings under a consistent audit log.
	- Investigation support — bulk ingest, async jobs, and reproducible query history.

	## Deploying on Hugging Face Spaces

	- Set `LLM_PROVIDER=huggingface`; use `HUGGINGFACE_API_KEY` and/or the Space secret `HF_TOKEN` (see [`.env.example`](.env.example)).
	- Use root `app.py` as the Streamlit entry for the default Hub command.
	- Hub UI, secrets, hardware, and Streamlit SDK details: [Streamlit Spaces](https://huggingface.co/docs/hub/spaces-sdks-streamlit), [Spaces overview](https://huggingface.co/docs/hub/spaces-overview).
	- Test locally before deploy: `uv run python scripts/verify_huggingface_inference.py` (requires `LLM_PROVIDER=huggingface` in `.env`).

	## Quick start with Docker

	Requires [Docker Engine](https://docs.docker.com/engine/) and Compose v2. The snippet below matches the shipped `docker-compose.yml`: API on 8000, Streamlit on 8501, with Chroma and SQLite under `/data` inside the API container. After `docker compose up -d`, expect `curl http://localhost:8000/health` to return JSON including `"status":"ok"`.

	```bash
	git clone <repository-url> doc-Audi-ai
	cd doc-Audi-ai
	cp .env.example .env
	# edit .env as needed; for compose Ollama: OLLAMA_BASE_URL=http://ollama:11434
	# (with host Ollama: run `ollama serve`; compose defaults to host.docker.internal:11434)

	docker compose build
	docker compose up -d
	curl -s http://localhost:8000/health
	# http://localhost:8501 — Streamlit
	docker compose down
	```

	Optional all-in-one Ollama in Compose: `docker compose --profile ollama up -d` (then set `OLLAMA_BASE_URL=http://ollama:11434` in `.env` and recreate containers).

	## How it works (user workflow)

	Collections, ingestion vs querying, jobs vs audit, Streamlit tabs, and per-button UI flows: [docs/USER_WORKFLOW.md](docs/USER_WORKFLOW.md).

	## Run and test (step-by-step)

	For ingestion formats, URL rules, job polling, sample `sample.txt` walkthrough, curl/PowerShell examples, and troubleshooting, see [docs/RUN_AND_TEST_GUIDE.md](docs/RUN_AND_TEST_GUIDE.md).

	For SQLite vs Memcached, offline DB inspection, and the Cursor SQLite Viewer extension (`qwtel.sqlite-viewer`), see [docs/SQLITE_AND_DB_INSPECTION.md](docs/SQLITE_AND_DB_INSPECTION.md).

	## Quick start (local, without Docker)

	Run the API with uv (or your preferred tool):

	```bash
	git clone <repository-url> doc-Audi-ai
	cd doc-Audi-ai
	cp .env.example .env
	uv sync
	ollama pull llama3.1:8b
	ollama pull nomic-embed-text
	uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

	uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload --reload-dir api --reload-dir storage
	```

	Optional UI:

	```bash
	uv run streamlit run streamlit_app.py --server.port 8501 --server.address 0.0.0.0
	```

	## API overview

	\| Method \| Path \| Description \|
	\|--------\|------\|-------------\|
	\| GET \| `/health` \| Liveness; returns configured app name and version \|
	\| POST \| `/ingest/upload` \| Multipart `files` (one or more); queues background ingest job \|
	\| POST \| `/ingest/url` \| JSON `urls` array (1–100); download and queue ingest \|
	\| GET \| `/ingest/collections` \| Lists collections with `document_count` and optional `created_at` \|
	\| DELETE \| `/ingest/collection/{collection_name}` \| Drops a collection; returns `documents_removed` \|
	\| GET \| `/jobs` \| Lists jobs with `total` count \|
	\| GET \| `/jobs/{job_id}` \| Job status with `progress_percent`, file counters, timestamps, `errors` \|
	\| POST \| `/query/ask` \| Grounded answer; request includes `top_k`, `user_id` \|
	\| POST \| `/query/summarise` \| Collection summary; distinct response shape (`summary`, `document_count`, …) \|
	\| POST \| `/query` \| Legacy alias of `/query/ask` \|
	\| GET \| `/audit/logs` \| Filterable audit index (`user_id`, `from_date`, `to_date`, pagination) \|
	\| GET \| `/audit/logs/{query_id}` \| Full stored answer and citations for one query \|

	Interactive docs: `http://localhost:8000/docs`.

	## Sample request and response (`POST /query/ask`)

	Request:

	```json
	{
	"question": "What were the key risk factors identified in the Q3 2023 financial report?",
	"collection_name": "default",
	"top_k": 5,
	"user_id": "analyst_001"
	}
	```

	Response (shape; values depend on your documents and model):

	```json
	{
	"query_id": "uuid-string",
	"question": "What were the key risk factors identified in the Q3 2023 financial report?",
	"answer": "… grounded text with citations …",
	"sources": [
	{
	"document_name": "q3_financial_report.pdf",
	"page_number": 12,
	"chunk_text": "Key risk factors include …",
	"relevance_score": 0.91
	}
	],
	"model_used": "llama3.1:8b",
	"tokens_used": 0,
	"response_time_ms": 1820,
	"timestamp": "2026-05-03T12:00:00Z"
	}
	```

	## Design decisions

	- Source citations — High-stakes review requires every substantive claim to be tied to document name and page (where available), not a free-floating model monologue.
	- Auditability — Each ask/summarise persists query id, user id, timing, model id, token usage (when the provider exposes it), and serialized sources so regulators or counsel can reconstruct what the system returned.

	## Scale note

	Architecture is designed for high-volume document ingestion via async background jobs (FastAPI `BackgroundTasks`), persistent Chroma collections, and a stateless API tier that can be replicated once you add a shared vector store and job queue.

	## Tests

	Automated API tests use pytest with isolated temp databases; they do not require a running server or Ollama.

	```bash
	uv sync
	uv run pytest tests/ -q
	```

	Full guide (commands, coverage by file, mocks vs manual smoke tests, troubleshooting): [docs/TESTING.md](docs/TESTING.md).

	## Configuration

	See `.env.example`. Common variables include `LLM_PROVIDER`, Ollama/OpenAI/Anthropic keys and models, `CHROMA_PERSIST_DIRECTORY`, `AUDIT_DB_PATH`, `JOBS_DB_PATH`, and upload limits (`MAX_FILE_SIZE_MB`; `MAX_UPLOAD_SIZE_MB` is accepted as an alias via settings normalization).