Spaces:

KunalShaw
/

RAG-Chatbot-Agentic-AI

Sleeping

File size: 13,991 Bytes

f9c215a

# Architecture Overview

This document explains the architecture of the RAG (Retrieval-Augmented Generation) chatbot for the Agentic AI eBook.

## System Overview

The system follows a standard RAG pattern: documents are chunked and embedded into a vector database during ingestion, then at query time, relevant chunks are retrieved and used to generate grounded answers.

### Key Components

1. **Ingestion Pipeline** (`app/ingest.py`) - Processes the PDF, creates chunks, generates embeddings, and stores in Pinecone
2. **Vector Store** (`app/vectorstore.py`) - Wrapper around Pinecone for storing and retrieving vectors
3. **RAG Pipeline** (`app/rag_pipeline.py`) - LangGraph-based pipeline for query processing
4. **Streamlit UI** (`streamlit_app/app.py`) - Web interface for user interactions

---

## Architecture Diagram

```
                              INGESTION FLOW
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────────────┐   │
│    │   PDF    │───▶│  Extract │───▶│  Clean   │───▶│     Chunk        │   │
│    │  File    │    │   Text   │    │   Text   │    │  (500 tokens,    │   │
│    │          │    │ by Page  │    │          │    │   50 overlap)    │   │
│    └──────────┘    └──────────┘    └──────────┘    └────────┬─────────┘   │
│                                                              │             │
│                                                              ▼             │
│    ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐   │
│    │     Pinecone     │◀───│     Upsert       │◀───│   Embeddings     │   │
│    │   Vector Store   │    │    Vectors       │    │  (MiniLM-L6-v2)  │   │
│    │                  │    │                  │    │   384 dims       │   │
│    └──────────────────┘    └──────────────────┘    └──────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘


                               QUERY FLOW
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│    ┌──────────┐                                                             │
│    │   User   │                                                             │
│    │  Query   │                                                             │
│    └────┬─────┘                                                             │
│         │                                                                   │
│         ▼                                                                   │
│    ┌──────────────────────────────────────────────────────────────────┐    │
│    │                    LANGGRAPH PIPELINE                             │    │
│    │                                                                   │    │
│    │  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐            │    │
│    │  │   Embed     │──▶│  Retrieve   │──▶│  Calculate  │            │    │
│    │  │   Query     │   │   Top-K     │   │ Confidence  │            │    │
│    │  │             │   │  Chunks     │   │             │            │    │
│    │  └─────────────┘   └──────┬──────┘   └──────┬──────┘            │    │
│    │                           │                 │                    │    │
│    │                           ▼                 ▼                    │    │
│    │                    ┌─────────────────────────────┐               │    │
│    │                    │     Generate Answer         │               │    │
│    │                    │                             │               │    │
│    │                    │  ┌─────────────────────┐   │               │    │
│    │                    │  │ If OpenAI Key:      │   │               │    │
│    │                    │  │  → LLM Generation   │   │               │    │
│    │                    │  │  (grounded prompt)  │   │               │    │
│    │                    │  ├─────────────────────┤   │               │    │
│    │                    │  │ Else:               │   │               │    │
│    │                    │  │  → Extractive Mode  │   │               │    │
│    │                    │  │  (return chunks)    │   │               │    │
│    │                    │  └─────────────────────┘   │               │    │
│    │                    └─────────────┬───────────────┘               │    │
│    │                                  │                               │    │
│    └──────────────────────────────────┼───────────────────────────────┘    │
│                                       │                                     │
│                                       ▼                                     │
│    ┌──────────────────────────────────────────────────────────────────┐    │
│    │                         RESPONSE                                  │    │
│    │  {                                                                │    │
│    │    "final_answer": "...",                                        │    │
│    │    "retrieved_chunks": [...],                                    │    │
│    │    "confidence": 0.92                                            │    │
│    │  }                                                                │    │
│    └──────────────────────────────────────────────────────────────────┘    │
│                                       │                                     │
│                                       ▼                                     │
│    ┌──────────────────────────────────────────────────────────────────┐    │
│    │                      STREAMLIT UI                                 │    │
│    │  ┌──────────────────┐  ┌───────────────────────────────────┐    │    │
│    │  │  Chat Interface  │  │  Retrieved Chunks Panel           │    │    │
│    │  │  - Question box  │  │  - Chunk text                     │    │    │
│    │  │  - Answer card   │  │  - Page numbers                   │    │    │
│    │  │  - Confidence    │  │  - Relevance scores               │    │    │
│    │  └──────────────────┘  └───────────────────────────────────┘    │    │
│    └──────────────────────────────────────────────────────────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Design Decisions

### 1. Chunking Strategy

We use **500 tokens** as the target chunk size with **50-100 token overlap**. This provides:
- Enough context for meaningful retrieval
- Overlap ensures important information spanning chunk boundaries isn't lost
- Token counting via tiktoken ensures consistent chunk sizes across different text densities

**Chunk ID Format**: `pdfpage_{page}_chunk_{index}` - This makes it easy to trace retrieved content back to the source PDF page for verification.

### 2. Embedding Model Choice

We use **sentence-transformers/all-MiniLM-L6-v2**:
- Open source and free (no API costs)
- Small model (384 dimensions) = fast inference and lower storage costs
- Good quality for semantic similarity tasks
- Can run entirely on CPU

Trade-off: Larger models like OpenAI's ada-002 (1536 dims) may provide better retrieval quality, but MiniLM offers excellent cost/performance ratio for this use case.

### 3. LangGraph Pipeline

The RAG pipeline uses LangGraph for orchestration because:
- Clear separation of pipeline stages (embed → retrieve → generate)
- Easy to add/modify nodes (e.g., reranking, query expansion)
- Built-in state management
- Aligns with modern LLM application patterns

### 4. Dual-Mode Answer Generation

The system supports two modes:

**LLM Generation Mode** (with OpenAI key):
- Uses GPT-3.5-turbo for natural language generation
- System prompt strictly instructs the model to only use provided chunks
- Produces more readable, synthesized answers

**Extractive Fallback Mode** (no API key):
- Returns relevant chunks directly with minimal formatting
- Always works, even offline
- Ensures the app is functional without paid APIs

This design choice ensures the application is **always functional** regardless of API availability.

### 5. Confidence Score Computation

Confidence is computed from retrieval similarity scores:

```python
# Normalize cosine similarity from [-1, 1] to [0, 1]
normalized = (score + 1) / 2

# Use maximum normalized score as confidence
confidence = max(normalized_scores)
```

This gives users an intuitive sense of how well the retrieved chunks match their query.

---

## File Structure

```
rag-eAgenticAI/
├── app/
│   ├── __init__.py          # Package exports
│   ├── ingest.py            # PDF → chunks → embeddings → Pinecone
│   ├── vectorstore.py       # Pinecone wrapper (create, upsert, query)
│   ├── rag_pipeline.py      # LangGraph pipeline + answer generation
│   └── utils.py             # Chunking, cleaning, confidence calculation
│
├── streamlit_app/
│   ├── app.py               # Main Streamlit application
│   └── assets/              # Static assets (images, CSS)
│
├── samples/
│   ├── sample_queries.txt   # Example questions to test
│   └── expected_responses.md # Expected JSON response format
│
├── infra/
│   └── hf_space_readme_template.md  # Hugging Face Spaces config
│
├── data/                    # PDF files and generated chunks (gitignored)
│
├── README.md                # Main documentation
├── architecture.md          # This file
├── requirements.txt         # Python dependencies
├── LICENSE                  # MIT License
└── .gitignore              # Git ignore rules
```

---

## Data Flow Summary

1. **Ingestion** (run once):
   - PDF → pdfplumber → raw text by page
   - Text → clean_text() → cleaned text
   - Cleaned text → chunk_text() → chunks with metadata
   - Chunks → SentenceTransformer → embeddings
   - Embeddings → Pinecone upsert → stored vectors

2. **Query** (each user question):
   - Question → SentenceTransformer → query embedding
   - Query embedding → Pinecone query → top-k chunks
   - Chunks + scores → compute_confidence() → confidence score
   - Chunks + question → LLM/extractive → final answer
   - Answer + chunks + confidence → JSON response → Streamlit UI