# **Entity-Relationship Diagram (ERD) & Table Definitions** ## **1. Entity-Relationship Diagram (Logical)** ```mermaid erDiagram USER ||--o{ USER_ROLE : "assigned" ROLE ||--o{ USER_ROLE : "assigned to" ROLE ||--o{ ROLE_DOCUMENT_ACCESS : "has access to" DOCUMENT ||--o{ ROLE_DOCUMENT_ACCESS : "is accessed by" DOCUMENT ||--o{ CHUNK : "contains" USER ||--o{ QUERY_LOG : "submits" QUERY_LOG ||--o| EVALUATION_RESULT : "evaluated by" USER { uuid id PK string email string password_hash string display_name boolean is_active timestamp created_at timestamp last_login } ROLE { uuid id PK string name "e.g., 'researcher', 'guest'" } USER_ROLE { uuid user_id FK uuid role_id FK } DOCUMENT { uuid id PK string arxiv_id "UNIQUE, from dataset" string title string source_url "Link to ArXiv" string full_citation jsonb metadata timestamp created_at } CHUNK { uuid id PK uuid document_id FK text content vector embedding "vector(384)" int chunk_index jsonb metadata "page/section refs" } ROLE_DOCUMENT_ACCESS { uuid role_id FK uuid document_id FK } QUERY_LOG { uuid id PK uuid user_id FK text query_text text response_text string model_provider int agent_steps timestamp created_at } EVALUATION_RESULT { uuid id PK uuid query_log_id FK float relevance_score float groundedness_score float answer_relevance_score string trulens_record_id "Cross-ref to TruLens DB" timestamp evaluated_at } ``` ## **2. Table Definitions (SQL)** ### **2.1 Core Tables** ```sql -- Enable pgvector extension CREATE EXTENSION IF NOT EXISTS vector; -- Users Table CREATE TABLE users ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), email TEXT UNIQUE NOT NULL, password_hash TEXT NOT NULL, display_name TEXT, is_active BOOLEAN NOT NULL DEFAULT TRUE, created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), last_login TIMESTAMP WITH TIME ZONE ); -- Roles Table CREATE TABLE roles ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT UNIQUE NOT NULL -- e.g., 'researcher', 'guest' ); -- User-Role Junction Table (role-level RBAC only — no per-user document grants) CREATE TABLE user_roles ( user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE, role_id UUID NOT NULL REFERENCES roles(id) ON DELETE CASCADE, PRIMARY KEY (user_id, role_id) ); -- Documents Table CREATE TABLE documents ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), arxiv_id TEXT UNIQUE NOT NULL, title TEXT, source_url TEXT, -- Direct link for user verification full_citation TEXT, -- Bibliographic reference metadata JSONB, created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() ); -- Chunks Table (each chunk belongs to exactly one document) CREATE TABLE chunks ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE, content TEXT NOT NULL, embedding VECTOR(384), chunk_index INT NOT NULL, metadata JSONB -- Stores page numbers, section headers, etc. ); -- Role-Document Access Control CREATE TABLE role_document_access ( role_id UUID NOT NULL REFERENCES roles(id) ON DELETE CASCADE, document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE, PRIMARY KEY (role_id, document_id) ); ``` ### **2.2 Observability Tables** ```sql -- Query Log: persists every agentic query and its final response CREATE TABLE query_logs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), user_id UUID REFERENCES users(id) ON DELETE SET NULL, query_text TEXT NOT NULL, response_text TEXT, model_provider TEXT, -- 'openai' | 'azure_openai' | 'gemini' agent_steps INT, -- SmolAgents steps taken (max 5) created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() ); -- Evaluation Results: RAG Triad scores, cross-referenced to TruLens CREATE TABLE evaluation_results ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), query_log_id UUID NOT NULL REFERENCES query_logs(id) ON DELETE CASCADE, relevance_score NUMERIC(4,3), -- Query vs. Reranked Context (0–1) groundedness_score NUMERIC(4,3), -- Answer vs. Reranked Context (0–1) answer_relevance_score NUMERIC(4,3), -- Query vs. Final Answer (0–1) trulens_record_id TEXT, -- Maps to TruLens Record.record_id evaluated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() ); ``` ## **3. Indexing Recommendations** ### **3.1 Vector Index — HNSW (384-dim, Cosine Distance)** HNSW is preferred over IVFFlat: better recall at query time, no need to pre-specify cluster counts, and no index rebuild required after inserts. Parameters are tuned for 384-dimensional sentence-transformer embeddings. ```sql -- Cosine distance is standard for sentence-transformer embeddings CREATE INDEX idx_chunks_embedding_hnsw ON chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); -- Tune recall vs. latency at query time (default: 40): -- SET hnsw.ef_search = 100; ``` | Parameter | Value | Rationale | |---|---|---| | `m` | 16 | Balanced graph connectivity; increase to 32 for higher recall at cost of memory | | `ef_construction` | 64 | Build-time quality; increase to 128 for finer index at cost of build time | | `ef_search` | 40–100 | Set per-session; 40 for <50 ms, 100 for maximum recall | | Distance op | `vector_cosine_ops` | Standard for normalized sentence-transformer output | ### **3.2 Access Control & Foreign Key Indexes** ```sql -- Speed up RBAC JOIN: "WHERE document_id IN (authorized_list)" CREATE INDEX idx_chunks_document_id ON chunks(document_id); -- Speed up role resolution for a given user CREATE INDEX idx_user_roles_user_id ON user_roles(user_id); -- Speed up document permission checks CREATE INDEX idx_role_doc_access_doc_id ON role_document_access(document_id); ``` ### **3.3 Observability Indexes** ```sql -- Time-ordered query history per user (pagination, monitoring dashboards) CREATE INDEX idx_query_logs_user_id_created ON query_logs(user_id, created_at DESC); -- Evaluation lookup by query CREATE INDEX idx_eval_results_query_log_id ON evaluation_results(query_log_id); -- TruLens cross-reference (partial — only rows that have been evaluated) CREATE INDEX idx_eval_results_trulens_id ON evaluation_results(trulens_record_id) WHERE trulens_record_id IS NOT NULL; ``` ## **4. Example Access-Aware Query with Citations** When the Agent performs a search, it retrieves `source_url`, `title`, and `full_citation` so it can cite its sources in the final response. ```sql SELECT c.content, d.title, d.source_url, d.full_citation, c.metadata->>'page' AS page_number, c.embedding <=> '[query_vector]'::vector AS distance FROM chunks c JOIN documents d ON c.document_id = d.id JOIN role_document_access rda ON d.id = rda.document_id JOIN user_roles ur ON rda.role_id = ur.role_id WHERE ur.user_id = 'current_user_id' ORDER BY distance LIMIT 20; ``` ## **5. TruLens Integration Strategy** TruLens maintains its own internal schema for detailed trace data. The `evaluation_results` table here serves as a **lightweight application-side mirror** for: * Querying aggregate RAG Triad scores without launching the TruLens dashboard. * Joining scores directly to your own `users` and `query_logs` data for reporting. * Setting up alerting when scores fall below SLA thresholds (relevance > 0.85, groundedness > 0.9). The `trulens_record_id` column maps to TruLens's `Record.record_id`, enabling a full round-trip lookup when deeper trace or span inspection is needed. Application evaluation code waits for feedback completion (`TRULENS_FEEDBACK_TIMEOUT`, `retrieve_feedback_results` in `backend/app/evaluation/trulens_eval.py`) before writing `evaluation_results`. The **context** embedded in the TruLens replay for Context Relevance and Groundedness is aligned with the **last `pgvector_retriever` formatted output** when available (`reranked_context`), not only LLM JSON.