Spaces:

DreamyDetective
/

agentic-rag

Paused

App Files Files Community

agentic-rag / docs /Database Design.md

vksepm

TruLens Azure judges, reranked context for metrics, feedback timeout; extend API proxy timeouts

1499b8f 3 months ago

preview code

Raw

History Blame Contribute Delete

8.54 kB

	# Entity-Relationship Diagram (ERD) & Table Definitions

	## 1. Entity-Relationship Diagram (Logical)

	```mermaid
	erDiagram
	USER \|\|--o{ USER_ROLE : "assigned"
	ROLE \|\|--o{ USER_ROLE : "assigned to"
	ROLE \|\|--o{ ROLE_DOCUMENT_ACCESS : "has access to"
	DOCUMENT \|\|--o{ ROLE_DOCUMENT_ACCESS : "is accessed by"
	DOCUMENT \|\|--o{ CHUNK : "contains"
	USER \|\|--o{ QUERY_LOG : "submits"
	QUERY_LOG \|\|--o\| EVALUATION_RESULT : "evaluated by"

	USER {
	uuid id PK
	string email
	string password_hash
	string display_name
	boolean is_active
	timestamp created_at
	timestamp last_login
	}
	ROLE {
	uuid id PK
	string name "e.g., 'researcher', 'guest'"
	}
	USER_ROLE {
	uuid user_id FK
	uuid role_id FK
	}
	DOCUMENT {
	uuid id PK
	string arxiv_id "UNIQUE, from dataset"
	string title
	string source_url "Link to ArXiv"
	string full_citation
	jsonb metadata
	timestamp created_at
	}
	CHUNK {
	uuid id PK
	uuid document_id FK
	text content
	vector embedding "vector(384)"
	int chunk_index
	jsonb metadata "page/section refs"
	}
	ROLE_DOCUMENT_ACCESS {
	uuid role_id FK
	uuid document_id FK
	}
	QUERY_LOG {
	uuid id PK
	uuid user_id FK
	text query_text
	text response_text
	string model_provider
	int agent_steps
	timestamp created_at
	}
	EVALUATION_RESULT {
	uuid id PK
	uuid query_log_id FK
	float relevance_score
	float groundedness_score
	float answer_relevance_score
	string trulens_record_id "Cross-ref to TruLens DB"
	timestamp evaluated_at
	}
	```

	## 2. Table Definitions (SQL)

	### 2.1 Core Tables

	```sql
	-- Enable pgvector extension
	CREATE EXTENSION IF NOT EXISTS vector;

	-- Users Table
	CREATE TABLE users (
	id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
	email TEXT UNIQUE NOT NULL,
	password_hash TEXT NOT NULL,
	display_name TEXT,
	is_active BOOLEAN NOT NULL DEFAULT TRUE,
	created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
	last_login TIMESTAMP WITH TIME ZONE
	);

	-- Roles Table
	CREATE TABLE roles (
	id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
	name TEXT UNIQUE NOT NULL -- e.g., 'researcher', 'guest'
	);

	-- User-Role Junction Table (role-level RBAC only — no per-user document grants)
	CREATE TABLE user_roles (
	user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
	role_id UUID NOT NULL REFERENCES roles(id) ON DELETE CASCADE,
	PRIMARY KEY (user_id, role_id)
	);

	-- Documents Table
	CREATE TABLE documents (
	id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
	arxiv_id TEXT UNIQUE NOT NULL,
	title TEXT,
	source_url TEXT, -- Direct link for user verification
	full_citation TEXT, -- Bibliographic reference
	metadata JSONB,
	created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
	);

	-- Chunks Table (each chunk belongs to exactly one document)
	CREATE TABLE chunks (
	id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
	document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
	content TEXT NOT NULL,
	embedding VECTOR(384),
	chunk_index INT NOT NULL,
	metadata JSONB -- Stores page numbers, section headers, etc.
	);

	-- Role-Document Access Control
	CREATE TABLE role_document_access (
	role_id UUID NOT NULL REFERENCES roles(id) ON DELETE CASCADE,
	document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
	PRIMARY KEY (role_id, document_id)
	);
	```

	### 2.2 Observability Tables

	```sql
	-- Query Log: persists every agentic query and its final response
	CREATE TABLE query_logs (
	id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
	user_id UUID REFERENCES users(id) ON DELETE SET NULL,
	query_text TEXT NOT NULL,
	response_text TEXT,
	model_provider TEXT, -- 'openai' \| 'azure_openai' \| 'gemini'
	agent_steps INT, -- SmolAgents steps taken (max 5)
	created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
	);

	-- Evaluation Results: RAG Triad scores, cross-referenced to TruLens
	CREATE TABLE evaluation_results (
	id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
	query_log_id UUID NOT NULL REFERENCES query_logs(id) ON DELETE CASCADE,
	relevance_score NUMERIC(4,3), -- Query vs. Reranked Context (0–1)
	groundedness_score NUMERIC(4,3), -- Answer vs. Reranked Context (0–1)
	answer_relevance_score NUMERIC(4,3), -- Query vs. Final Answer (0–1)
	trulens_record_id TEXT, -- Maps to TruLens Record.record_id
	evaluated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
	);
	```

	## 3. Indexing Recommendations

	### 3.1 Vector Index — HNSW (384-dim, Cosine Distance)

	HNSW is preferred over IVFFlat: better recall at query time, no need to pre-specify cluster counts, and no index rebuild required after inserts. Parameters are tuned for 384-dimensional sentence-transformer embeddings.

	```sql
	-- Cosine distance is standard for sentence-transformer embeddings
	CREATE INDEX idx_chunks_embedding_hnsw
	ON chunks USING hnsw (embedding vector_cosine_ops)
	WITH (m = 16, ef_construction = 64);

	-- Tune recall vs. latency at query time (default: 40):
	-- SET hnsw.ef_search = 100;
	```

	\| Parameter \| Value \| Rationale \|
	\|---\|---\|---\|
	\| `m` \| 16 \| Balanced graph connectivity; increase to 32 for higher recall at cost of memory \|
	\| `ef_construction` \| 64 \| Build-time quality; increase to 128 for finer index at cost of build time \|
	\| `ef_search` \| 40–100 \| Set per-session; 40 for <50 ms, 100 for maximum recall \|
	\| Distance op \| `vector_cosine_ops` \| Standard for normalized sentence-transformer output \|

	### 3.2 Access Control & Foreign Key Indexes

	```sql
	-- Speed up RBAC JOIN: "WHERE document_id IN (authorized_list)"
	CREATE INDEX idx_chunks_document_id ON chunks(document_id);

	-- Speed up role resolution for a given user
	CREATE INDEX idx_user_roles_user_id ON user_roles(user_id);

	-- Speed up document permission checks
	CREATE INDEX idx_role_doc_access_doc_id ON role_document_access(document_id);
	```

	### 3.3 Observability Indexes

	```sql
	-- Time-ordered query history per user (pagination, monitoring dashboards)
	CREATE INDEX idx_query_logs_user_id_created ON query_logs(user_id, created_at DESC);

	-- Evaluation lookup by query
	CREATE INDEX idx_eval_results_query_log_id ON evaluation_results(query_log_id);

	-- TruLens cross-reference (partial — only rows that have been evaluated)
	CREATE INDEX idx_eval_results_trulens_id
	ON evaluation_results(trulens_record_id)
	WHERE trulens_record_id IS NOT NULL;
	```

	## 4. Example Access-Aware Query with Citations

	When the Agent performs a search, it retrieves `source_url`, `title`, and `full_citation` so it can cite its sources in the final response.

	```sql
	SELECT
	c.content,
	d.title,
	d.source_url,
	d.full_citation,
	c.metadata->>'page' AS page_number,
	c.embedding <=> '[query_vector]'::vector AS distance
	FROM chunks c
	JOIN documents d ON c.document_id = d.id
	JOIN role_document_access rda ON d.id = rda.document_id
	JOIN user_roles ur ON rda.role_id = ur.role_id
	WHERE ur.user_id = 'current_user_id'
	ORDER BY distance
	LIMIT 20;
	```

	## 5. TruLens Integration Strategy

	TruLens maintains its own internal schema for detailed trace data. The `evaluation_results` table here serves as a lightweight application-side mirror for:

	* Querying aggregate RAG Triad scores without launching the TruLens dashboard.
	* Joining scores directly to your own `users` and `query_logs` data for reporting.
	* Setting up alerting when scores fall below SLA thresholds (relevance > 0.85, groundedness > 0.9).

	The `trulens_record_id` column maps to TruLens's `Record.record_id`, enabling a full round-trip lookup when deeper trace or span inspection is needed.

	Application evaluation code waits for feedback completion (`TRULENS_FEEDBACK_TIMEOUT`, `retrieve_feedback_results` in `backend/app/evaluation/trulens_eval.py`) before writing `evaluation_results`. The context embedded in the TruLens replay for Context Relevance and Groundedness is aligned with the last `pgvector_retriever` formatted output when available (`reranked_context`), not only LLM JSON.