Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.55.0
Architecture Overview
This document explains the architecture of the RAG (Retrieval-Augmented Generation) chatbot for the Agentic AI eBook.
System Overview
The system follows a standard RAG pattern: documents are chunked and embedded into a vector database during ingestion, then at query time, relevant chunks are retrieved and used to generate grounded answers.
Key Components
- Ingestion Pipeline (
app/ingest.py) - Processes the PDF, creates chunks, generates embeddings, and stores in Pinecone - Vector Store (
app/vectorstore.py) - Wrapper around Pinecone for storing and retrieving vectors - RAG Pipeline (
app/rag_pipeline.py) - LangGraph-based pipeline for query processing - Streamlit UI (
streamlit_app/app.py) - Web interface for user interactions
Architecture Diagram
INGESTION FLOW
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββ β
β β PDF βββββΆβ Extract βββββΆβ Clean βββββΆβ Chunk β β
β β File β β Text β β Text β β (500 tokens, β β
β β β β by Page β β β β 50 overlap) β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββ¬ββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ β
β β Pinecone ββββββ Upsert ββββββ Embeddings β β
β β Vector Store β β Vectors β β (MiniLM-L6-v2) β β
β β β β β β 384 dims β β
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
QUERY FLOW
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ββββββββββββ β
β β User β β
β β Query β β
β ββββββ¬ββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LANGGRAPH PIPELINE β β
β β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β Embed ββββΆβ Retrieve ββββΆβ Calculate β β β
β β β Query β β Top-K β β Confidence β β β
β β β β β Chunks β β β β β
β β βββββββββββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β β
β β β β β β
β β βΌ βΌ β β
β β βββββββββββββββββββββββββββββββ β β
β β β Generate Answer β β β
β β β β β β
β β β βββββββββββββββββββββββ β β β
β β β β If OpenAI Key: β β β β
β β β β β LLM Generation β β β β
β β β β (grounded prompt) β β β β
β β β βββββββββββββββββββββββ€ β β β
β β β β Else: β β β β
β β β β β Extractive Mode β β β β
β β β β (return chunks) β β β β
β β β βββββββββββββββββββββββ β β β
β β βββββββββββββββ¬ββββββββββββββββ β β
β β β β β
β ββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β RESPONSE β β
β β { β β
β β "final_answer": "...", β β
β β "retrieved_chunks": [...], β β
β β "confidence": 0.92 β β
β β } β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STREAMLIT UI β β
β β ββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββ β β
β β β Chat Interface β β Retrieved Chunks Panel β β β
β β β - Question box β β - Chunk text β β β
β β β - Answer card β β - Page numbers β β β
β β β - Confidence β β - Relevance scores β β β
β β ββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Design Decisions
1. Chunking Strategy
We use 500 tokens as the target chunk size with 50-100 token overlap. This provides:
- Enough context for meaningful retrieval
- Overlap ensures important information spanning chunk boundaries isn't lost
- Token counting via tiktoken ensures consistent chunk sizes across different text densities
Chunk ID Format: pdfpage_{page}_chunk_{index} - This makes it easy to trace retrieved content back to the source PDF page for verification.
2. Embedding Model Choice
We use sentence-transformers/all-MiniLM-L6-v2:
- Open source and free (no API costs)
- Small model (384 dimensions) = fast inference and lower storage costs
- Good quality for semantic similarity tasks
- Can run entirely on CPU
Trade-off: Larger models like OpenAI's ada-002 (1536 dims) may provide better retrieval quality, but MiniLM offers excellent cost/performance ratio for this use case.
3. LangGraph Pipeline
The RAG pipeline uses LangGraph for orchestration because:
- Clear separation of pipeline stages (embed β retrieve β generate)
- Easy to add/modify nodes (e.g., reranking, query expansion)
- Built-in state management
- Aligns with modern LLM application patterns
4. Dual-Mode Answer Generation
The system supports two modes:
LLM Generation Mode (with OpenAI key):
- Uses GPT-3.5-turbo for natural language generation
- System prompt strictly instructs the model to only use provided chunks
- Produces more readable, synthesized answers
Extractive Fallback Mode (no API key):
- Returns relevant chunks directly with minimal formatting
- Always works, even offline
- Ensures the app is functional without paid APIs
This design choice ensures the application is always functional regardless of API availability.
5. Confidence Score Computation
Confidence is computed from retrieval similarity scores:
# Normalize cosine similarity from [-1, 1] to [0, 1]
normalized = (score + 1) / 2
# Use maximum normalized score as confidence
confidence = max(normalized_scores)
This gives users an intuitive sense of how well the retrieved chunks match their query.
File Structure
rag-eAgenticAI/
βββ app/
β βββ __init__.py # Package exports
β βββ ingest.py # PDF β chunks β embeddings β Pinecone
β βββ vectorstore.py # Pinecone wrapper (create, upsert, query)
β βββ rag_pipeline.py # LangGraph pipeline + answer generation
β βββ utils.py # Chunking, cleaning, confidence calculation
β
βββ streamlit_app/
β βββ app.py # Main Streamlit application
β βββ assets/ # Static assets (images, CSS)
β
βββ samples/
β βββ sample_queries.txt # Example questions to test
β βββ expected_responses.md # Expected JSON response format
β
βββ infra/
β βββ hf_space_readme_template.md # Hugging Face Spaces config
β
βββ data/ # PDF files and generated chunks (gitignored)
β
βββ README.md # Main documentation
βββ architecture.md # This file
βββ requirements.txt # Python dependencies
βββ LICENSE # MIT License
βββ .gitignore # Git ignore rules
Data Flow Summary
Ingestion (run once):
- PDF β pdfplumber β raw text by page
- Text β clean_text() β cleaned text
- Cleaned text β chunk_text() β chunks with metadata
- Chunks β SentenceTransformer β embeddings
- Embeddings β Pinecone upsert β stored vectors
Query (each user question):
- Question β SentenceTransformer β query embedding
- Query embedding β Pinecone query β top-k chunks
- Chunks + scores β compute_confidence() β confidence score
- Chunks + question β LLM/extractive β final answer
- Answer + chunks + confidence β JSON response β Streamlit UI