Spaces:

KBaba7
/

DocsBot

Sleeping

File size: 6,728 Bytes

10eb017
 
 
 
 
 
 
 
 
 
90b7bb0
5dccc28
90b7bb0
5dccc28
90b7bb0
5dccc28
90b7bb0
 
 
5dccc28
90b7bb0
 
 
 
 
 
 
 
 
 
d197c9d
90b7bb0
 
 
 
 
 
 
 
 
aefb7b1
 
d197c9d
 
 
5dccc28
 
 
d197c9d
 
 
90b7bb0
 
d197c9d
 
 
5dccc28
 
 
d197c9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dccc28
 
 
90b7bb0
5dccc28
90b7bb0
 
5dccc28
90b7bb0
5dccc28
90b7bb0
5dccc28
90b7bb0
5dccc28
90b7bb0
 
 
62d2116
90b7bb0
 
 
5dccc28
90b7bb0
5dccc28
90b7bb0
d197c9d
 
 
 
 
 
 
 
 
 
 
5dccc28
90b7bb0
5dccc28
90b7bb0
5dccc28
90b7bb0
 
5dccc28
90b7bb0
 
 
 
5dccc28
aefb7b1
62d2116
aefb7b1
 
d197c9d
 
 
62d2116
90b7bb0
5dccc28
90b7bb0
d197c9d
90b7bb0
 
 
 
d197c9d
 
5dccc28
90b7bb0
5dccc28
62d2116
 
 
 
5dccc28
90b7bb0
5dccc28
 
90b7bb0
5dccc28
 
 
 
 
 
 
 
90b7bb0
5dccc28
90b7bb0
 
 
 
 
d197c9d
 
90b7bb0
 
d197c9d
 
90b7bb0
d197c9d
 
90b7bb0
 
 
 
 
 
 
 
 
 
 
5dccc28
d197c9d
 
 
 
5dccc28
 
 
 
 
 
 
90b7bb0
 
5dccc28
d197c9d
5dccc28
90b7bb0

---
title: DocsQA Smart Research Assistant
emoji: 📄
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 7860
---

# DocsQA Smart Research Assistant

This is my take-home submission for the ABSTRABIT AI/ML Engineer assignment: a RAG-powered assistant where users upload PDFs, ask questions, and get grounded answers with citations.

## Live Project

- Live app (Railway): `https://docsbot-web-production.up.railway.app`
- GitHub: `https://github.com/KBaba7/DocsBot`
- Loom walkthrough: _add your link here_

## What I Built

The app supports authentication, PDF upload (up to 5 files and 10 pages per file), document chunking + vector indexing, and a chat experience that answers from uploaded documents first.  
If the uploaded documents are not enough, the agent falls back to web search and cites those sources too.

## Stack

- FastAPI + SQLAlchemy
- LangGraph agent
- Groq chat model
- Jina embeddings + Jina reranker
- Supabase Postgres + `pgvector`
- Railway deployment

## How Retrieval Works

Uploaded PDFs are parsed page by page and split into chunks.  
Each chunk is stored with metadata (document, page number, chunk index) and embedded into `pgvector`.

At question time:
1. LLM-based document filtering selects relevant documents from user's library
2. Vector search retrieves relevant chunks from selected documents
3. Jina reranking reorders the retrieved chunks for better final relevance
4. The agent answers from those chunks when possible
5. If evidence is weak, the agent uses web search and cites external URLs

## Chunking Strategy

- Splitter: LangChain `RecursiveCharacterTextSplitter`
- Chunk size: `1000`
- Overlap: `150`

Why this setup:
- It prefers breaking on paragraphs and sentence boundaries before falling back to smaller separators.
- It preserves more coherent chunks for contracts, specs, and structured PDFs.
- A smaller overlap keeps recall while reducing duplicated context in retrieval.

## Retrieval Approach

I use cosine similarity search in `pgvector`, then apply Jina reranking for better final ordering.  
The system uses an LLM-based retrieval planner to choose:

- the final number of chunks to keep
- the candidate pool to rerank

Those values are clamped to safe bounds before retrieval runs.

The UI shows:

- document name
- page number
- chunk excerpt

for retrieved document sources.

## Agent Routing Logic

The agent is prompted to prefer document context first.

- If retrieved document context is sufficient: answer from documents with citations.
- If not sufficient: clearly say docs are insufficient and use web search tool.

This is implemented as tool-based behavior in LangGraph rather than a static fallback message.

## Source Citations

Each turn stores/returns source metadata separately from the answer body.

- Vector source cards include:
  - document name
  - page number
  - snippet (short snippet from retrieved chunk)
- Web source cards include:
  - title
  - URL

## Conversation Memory

Conversation history is maintained within session scope, so follow-ups like “tell me more about that” work as expected.
The frontend also preserves the visible chat thread per session, so upload-triggered page refreshes do not wipe the current conversation view.

## Streaming UX

Answers are streamed into the chat UI progressively.

- the visible response is rendered chunk by chunk
- source cards are attached after the answer completes
- a slight pacing delay is added so the stream feels live to the user

The streaming route is separate from the standard JSON `/ask` response path.

## Bonus Feature

I added hash-based deduplicated ingestion:

- If the same PDF is uploaded again, processing/indexing is reused.
- Access control is still user-scoped via ownership mapping.

Why I chose this:
- saves compute/time,
- avoids duplicate indexing,
- keeps retrieval secure per user.

I also implemented LLM-based document filtering:

- The system sends all user documents (filename, summary, preview) to the LLM
- LLM semantically analyzes and selects only truly relevant documents for the query
- Returns a JSON array of relevant file hashes
- It is not forced to return a capped number of documents
- Fallback returns all candidate document hashes if the LLM call fails

## Challenges I Ran Into

1. Heavy embedding dependencies made deployment images too large.
   - I standardized on Jina API embeddings/reranking to keep the runtime lighter while preserving retrieval quality.
2. Source rendering got messy across multiple chat turns.
   - I separated answer text from source payloads and extracted sources per turn.
3. Intermittent DB DNS/pooler issues during deployment.
   - I improved connection handling and standardized Supabase transaction-pooler config.
4. UI state was getting lost after document uploads.
   - I persisted the active chat thread in session storage so the current conversation remains visible after refresh.

## If I Had More Time

- Add conversation history UI to display past chat sessions
- Add automated citation-faithfulness checks
- Add Alembic migrations for cleaner schema evolution
- Add stronger eval/observability for routing and retrieval quality

## Local Setup

```bash
cp .env.example .env
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
uvicorn app.main:app --reload
```

Open: `http://127.0.0.1:8000`

## Important Environment Variables

Required:
- `GROQ_API_KEY`
- `SECRET_KEY`
- `DATABASE_URL`
- `JINA_API_KEY`

Embeddings:
- `JINA_API_BASE` (default: `https://api.jina.ai/v1/embeddings`)
- `JINA_EMBEDDING_MODEL` (default: `jina-embeddings-v3`)
- `JINA_RERANKER_API_BASE` (default: `https://api.jina.ai/v1/rerank`)
- `JINA_RERANKER_MODEL` (default: `jina-reranker-v3`)
- `EMBEDDING_DIMENSIONS` (default: `1024`)
- `RETRIEVAL_K` (default minimum final context size: `4`)
- `RERANK_CANDIDATE_K` (default minimum rerank candidate pool: `12`)

Storage:
- `STORAGE_BACKEND=local|supabase`
- `SUPABASE_URL`
- `SUPABASE_SERVICE_ROLE_KEY`
- `SUPABASE_STORAGE_BUCKET`
- `SUPABASE_STORAGE_PREFIX`

Web search:
- `WEB_SEARCH_PROVIDER=duckduckgo|tavily`
- `TAVILY_API_KEY` (if using Tavily)

Auth:
- `ACCESS_TOKEN_EXPIRE_MINUTES` (default: `720`)
- For local development, lowering this can make login/logout testing easier

## API Endpoints

- `POST /register`
- `POST /login`
- `POST /logout`
- `POST /upload`
- `GET /documents`
- `DELETE /documents/{document_id}`
- `GET /documents/{document_id}/pdf`
- `POST /ask`
- `POST /ask/stream`

## Sample Documents

As requested in the assignment, sample PDFs are included in `test_documents/`.

## Railway Deployment

```bash
railway login
railway link
railway up
```

Set the same env vars in Railway service settings before deploying.