Spaces:

KunalShaw
/

RAG-Chatbot-Agentic-AI

Sleeping

App Files Files Community

RAG-Chatbot-Agentic-AI / architecture.md

KUNAL SHAW

Initial commit: RAG Chatbot for Agentic AI eBook with LangGraph, Pinecone, and Groq

f9c215a about 1 month ago

preview code

raw

history blame contribute delete

14 kB

	# Architecture Overview

	This document explains the architecture of the RAG (Retrieval-Augmented Generation) chatbot for the Agentic AI eBook.

	## System Overview

	The system follows a standard RAG pattern: documents are chunked and embedded into a vector database during ingestion, then at query time, relevant chunks are retrieved and used to generate grounded answers.

	### Key Components

	1. Ingestion Pipeline (`app/ingest.py`) - Processes the PDF, creates chunks, generates embeddings, and stores in Pinecone
	2. Vector Store (`app/vectorstore.py`) - Wrapper around Pinecone for storing and retrieving vectors
	3. RAG Pipeline (`app/rag_pipeline.py`) - LangGraph-based pipeline for query processing
	4. Streamlit UI (`streamlit_app/app.py`) - Web interface for user interactions

	---

	## Architecture Diagram

	```
	INGESTION FLOW
	┌─────────────────────────────────────────────────────────────────────────────┐
	│ │
	│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
	│ │ PDF │───▶│ Extract │───▶│ Clean │───▶│ Chunk │ │
	│ │ File │ │ Text │ │ Text │ │ (500 tokens, │ │
	│ │ │ │ by Page │ │ │ │ 50 overlap) │ │
	│ └──────────┘ └──────────┘ └──────────┘ └────────┬─────────┘ │
	│ │ │
	│ ▼ │
	│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
	│ │ Pinecone │◀───│ Upsert │◀───│ Embeddings │ │
	│ │ Vector Store │ │ Vectors │ │ (MiniLM-L6-v2) │ │
	│ │ │ │ │ │ 384 dims │ │
	│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
	│ │
	└─────────────────────────────────────────────────────────────────────────────┘


	QUERY FLOW
	┌─────────────────────────────────────────────────────────────────────────────┐
	│ │
	│ ┌──────────┐ │
	│ │ User │ │
	│ │ Query │ │
	│ └────┬─────┘ │
	│ │ │
	│ ▼ │
	│ ┌──────────────────────────────────────────────────────────────────┐ │
	│ │ LANGGRAPH PIPELINE │ │
	│ │ │ │
	│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
	│ │ │ Embed │──▶│ Retrieve │──▶│ Calculate │ │ │
	│ │ │ Query │ │ Top-K │ │ Confidence │ │ │
	│ │ │ │ │ Chunks │ │ │ │ │
	│ │ └─────────────┘ └──────┬──────┘ └──────┬──────┘ │ │
	│ │ │ │ │ │
	│ │ ▼ ▼ │ │
	│ │ ┌─────────────────────────────┐ │ │
	│ │ │ Generate Answer │ │ │
	│ │ │ │ │ │
	│ │ │ ┌─────────────────────┐ │ │ │
	│ │ │ │ If OpenAI Key: │ │ │ │
	│ │ │ │ → LLM Generation │ │ │ │
	│ │ │ │ (grounded prompt) │ │ │ │
	│ │ │ ├─────────────────────┤ │ │ │
	│ │ │ │ Else: │ │ │ │
	│ │ │ │ → Extractive Mode │ │ │ │
	│ │ │ │ (return chunks) │ │ │ │
	│ │ │ └─────────────────────┘ │ │ │
	│ │ └─────────────┬───────────────┘ │ │
	│ │ │ │ │
	│ └──────────────────────────────────┼───────────────────────────────┘ │
	│ │ │
	│ ▼ │
	│ ┌──────────────────────────────────────────────────────────────────┐ │
	│ │ RESPONSE │ │
	│ │ { │ │
	│ │ "final_answer": "...", │ │
	│ │ "retrieved_chunks": [...], │ │
	│ │ "confidence": 0.92 │ │
	│ │ } │ │
	│ └──────────────────────────────────────────────────────────────────┘ │
	│ │ │
	│ ▼ │
	│ ┌──────────────────────────────────────────────────────────────────┐ │
	│ │ STREAMLIT UI │ │
	│ │ ┌──────────────────┐ ┌───────────────────────────────────┐ │ │
	│ │ │ Chat Interface │ │ Retrieved Chunks Panel │ │ │
	│ │ │ - Question box │ │ - Chunk text │ │ │
	│ │ │ - Answer card │ │ - Page numbers │ │ │
	│ │ │ - Confidence │ │ - Relevance scores │ │ │
	│ │ └──────────────────┘ └───────────────────────────────────┘ │ │
	│ └──────────────────────────────────────────────────────────────────┘ │
	│ │
	└─────────────────────────────────────────────────────────────────────────────┘
	```

	---

	## Design Decisions

	### 1. Chunking Strategy

	We use 500 tokens as the target chunk size with 50-100 token overlap. This provides:
	- Enough context for meaningful retrieval
	- Overlap ensures important information spanning chunk boundaries isn't lost
	- Token counting via tiktoken ensures consistent chunk sizes across different text densities

	Chunk ID Format: `pdfpage_{page}_chunk_{index}` - This makes it easy to trace retrieved content back to the source PDF page for verification.

	### 2. Embedding Model Choice

	We use sentence-transformers/all-MiniLM-L6-v2:
	- Open source and free (no API costs)
	- Small model (384 dimensions) = fast inference and lower storage costs
	- Good quality for semantic similarity tasks
	- Can run entirely on CPU

	Trade-off: Larger models like OpenAI's ada-002 (1536 dims) may provide better retrieval quality, but MiniLM offers excellent cost/performance ratio for this use case.

	### 3. LangGraph Pipeline

	The RAG pipeline uses LangGraph for orchestration because:
	- Clear separation of pipeline stages (embed → retrieve → generate)
	- Easy to add/modify nodes (e.g., reranking, query expansion)
	- Built-in state management
	- Aligns with modern LLM application patterns

	### 4. Dual-Mode Answer Generation

	The system supports two modes:

	LLM Generation Mode (with OpenAI key):
	- Uses GPT-3.5-turbo for natural language generation
	- System prompt strictly instructs the model to only use provided chunks
	- Produces more readable, synthesized answers

	Extractive Fallback Mode (no API key):
	- Returns relevant chunks directly with minimal formatting
	- Always works, even offline
	- Ensures the app is functional without paid APIs

	This design choice ensures the application is always functional regardless of API availability.

	### 5. Confidence Score Computation

	Confidence is computed from retrieval similarity scores:

	```python
	# Normalize cosine similarity from [-1, 1] to [0, 1]
	normalized = (score + 1) / 2

	# Use maximum normalized score as confidence
	confidence = max(normalized_scores)
	```

	This gives users an intuitive sense of how well the retrieved chunks match their query.

	---

	## File Structure

	```
	rag-eAgenticAI/
	├── app/
	│ ├── __init__.py # Package exports
	│ ├── ingest.py # PDF → chunks → embeddings → Pinecone
	│ ├── vectorstore.py # Pinecone wrapper (create, upsert, query)
	│ ├── rag_pipeline.py # LangGraph pipeline + answer generation
	│ └── utils.py # Chunking, cleaning, confidence calculation
	│
	├── streamlit_app/
	│ ├── app.py # Main Streamlit application
	│ └── assets/ # Static assets (images, CSS)
	│
	├── samples/
	│ ├── sample_queries.txt # Example questions to test
	│ └── expected_responses.md # Expected JSON response format
	│
	├── infra/
	│ └── hf_space_readme_template.md # Hugging Face Spaces config
	│
	├── data/ # PDF files and generated chunks (gitignored)
	│
	├── README.md # Main documentation
	├── architecture.md # This file
	├── requirements.txt # Python dependencies
	├── LICENSE # MIT License
	└── .gitignore # Git ignore rules
	```

	---

	## Data Flow Summary

	1. Ingestion (run once):
	- PDF → pdfplumber → raw text by page
	- Text → clean_text() → cleaned text
	- Cleaned text → chunk_text() → chunks with metadata
	- Chunks → SentenceTransformer → embeddings
	- Embeddings → Pinecone upsert → stored vectors

	2. Query (each user question):
	- Question → SentenceTransformer → query embedding
	- Query embedding → Pinecone query → top-k chunks
	- Chunks + scores → compute_confidence() → confidence score
	- Chunks + question → LLM/extractive → final answer
	- Answer + chunks + confidence → JSON response → Streamlit UI