| # RAG Implementation Log |
|
|
| ## Progress Tracking |
|
|
| Implementation of the RAG system based on PLAN.md β COMPLETE β |
|
|
| ## 2026-03-26 β Full Implementation |
|
|
| ### Phase 1: Environment & Dependencies β |
| - Created `requirements.txt` with all necessary Python packages |
| - Created `config.py` with: |
| - Document source paths (lectures, datasheets, app notes, source code) |
| - Ollama configuration (base URL, model selection) |
| - Chunking parameters (size, overlap for code and prose) |
| - ChromaDB persistence settings |
|
|
| ### Phase 2: Document Ingestion β |
| - Implemented `ingest/code_loader.py` |
| - Loads instructor solution source files (.cpp, .h, .c) |
| - Skips student templates (noise reduction) |
| - Adds source headers for citation tracking |
| |
| - Implemented `ingest/pptx_extract.py` |
| - Extracts text from PowerPoint slides |
| - One document per slide for granular retrieval |
| - Preserves slide numbers for citations |
|
|
| - Implemented `ingest/pdf_ocr.py` |
| - Fast path: pdfplumber for native-text PDFs |
| - Fallback: PaddleOCR for image-heavy/scanned PDFs |
| - Sparsity detection to choose best extraction method |
| - Page-level granularity for citations |
|
|
| - Implemented `ingest/chunker.py` |
| - Overlapping text chunks (langchain RecursiveCharacterTextSplitter) |
| - Different strategies for code vs prose |
| - Code separators: function/class boundaries |
| - Chunk metadata includes source, page, assignment info |
|
|
| ### Phase 3: Embedding & Vector Store β |
| - Implemented `vectorstore/embedder.py` |
| - Calls Ollama `/api/embeddings` endpoint |
| - Wraps nomic-embed-text model (768-dim vectors) |
| - Includes error handling with zero-vector fallback |
|
|
| - Implemented `vectorstore/store.py` |
| - ChromaDB persistent client management |
| - Custom OllamaEmbeddingFunction class for integration |
| - `add_documents()` β store chunks with embeddings |
| - `query()` β retrieve top-k similar chunks |
| - Cosine similarity metric for document retrieval |
|
|
| ### Phase 4: Query Pipeline β |
| - Implemented `query/retriever.py` |
| - Simple wrapper around vector store queries |
| - Configurable top-k retrieval (default 5) |
|
|
| - Implemented `query/prompt_builder.py` |
| - System prompt guides LLM to use context only |
| - Formats retrieved chunks with source citations |
| - Builds structured messages for Ollama chat API |
|
|
| - Implemented `query/generator.py` |
| - Calls Ollama `/api/chat` endpoint |
| - Handles errors gracefully |
| - Returns response text directly |
|
|
| ### Phase 5: CLI Scripts β |
| - Implemented `scripts/ingest_all.py` |
| - Orchestrates full pipeline: extraction β chunking β embedding β storage |
| - Walks all document directories recursively |
| - Separates code vs prose for appropriate chunking |
| - `--dry-run` flag for OCR quality testing |
| - Prints summary statistics per category |
|
|
| - Implemented `scripts/query_cli.py` |
| - Interactive loop for asking questions |
| - Shows retrieved chunks on `--verbose` flag |
| - Displays source citations with each answer |
| - Clean formatting for terminal output |
|
|
| - Implemented `scripts/launch_ui.py` |
| - Gradio web interface on localhost:7860 |
| - Text input for questions |
| - Toggle to show/hide retrieved sources |
| - User-friendly markdown output for answers |
|
|
| ### Phase 6: Testing β |
| - Implemented `tests/test_ingest.py` |
| - Verifies code loader finds instructor files |
| - Checks that student directories are skipped |
| - Tests chunking respects size bounds |
| - Code chunks use appropriate larger sizes |
|
|
| - Implemented `tests/test_retrieval.py` |
| - Tests ChromaDB collection initialization |
| - Validates add_documents and query interface |
| - Checks retrieval respects top-k parameter |
| - Tests retrieve function structure |
| |
| - Implemented `tests/test_end_to_end.py` |
| - Full pipeline interface tests |
| - Prompt building with context validation |
| - Generation interface verification |
| - Graceful skipping when Ollama unavailable |
|
|
| ### Documentation & Configuration β |
| - Created `.gitignore` to exclude: |
| - Virtual environment |
| - ChromaDB persistent storage |
| - Cache and build artifacts |
| |
| - Created `README.md` with: |
| - Quick start guide |
| - Installation instructions |
| - Configuration options |
| - Example queries |
| - Architecture diagram |
| - Troubleshooting guide |
| - Known limitations |
|
|
| ## Implementation Statistics |
|
|
| - **Total Python files:** 17 |
| - **Total lines of code:** ~1400 |
| - **Phases completed:** 6/6 β |
|
|
| ## Directory Structure (Final) |
|
|
| ``` |
| rag/ |
| βββ README.md # User guide |
| βββ PLAN.md # Architecture plan |
| βββ LOG.md # This file |
| βββ requirements.txt # Python dependencies |
| βββ config.py # Centralized configuration |
| βββ .gitignore # Git exclusions |
| βββ ingest/ # Document extraction |
| β βββ __init__.py |
| β βββ code_loader.py |
| β βββ pptx_extract.py |
| β βββ pdf_ocr.py |
| β βββ chunker.py |
| βββ vectorstore/ # Vector storage |
| β βββ __init__.py |
| β βββ embedder.py |
| β βββ store.py |
| βββ query/ # Query pipeline |
| β βββ __init__.py |
| β βββ retriever.py |
| β βββ prompt_builder.py |
| β βββ generator.py |
| βββ scripts/ # CLI tools |
| β βββ ingest_all.py |
| β βββ query_cli.py |
| β βββ launch_ui.py |
| βββ tests/ # Test suite |
| β βββ __init__.py |
| β βββ test_ingest.py |
| β βββ test_retrieval.py |
| β βββ test_end_to_end.py |
| βββ chroma_db/ # Vector storage (gitignored) |
| βββ [ChromaDB data] |
| ``` |
|
|
| ## Next Steps for Usage |
|
|
| 1. Install dependencies: `pip install -r requirements.txt` |
| 2. Ensure Ollama is running: `ollama serve` |
| 3. Ingest documents: `python scripts/ingest_all.py` |
| 4. Query: |
| - CLI: `python scripts/query_cli.py` |
| - Web UI: `python scripts/launch_ui.py` |
|
|
| ## Key Design Decisions |
|
|
| 1. **Pdfplumber + PaddleOCR fallback** β Fast for native PDFs, handles scanned documents |
| 2. **ChromaDB** β Embedded vector store, no server needed, persistent on disk |
| 3. **Ollama local inference** β Privacy-respecting, no API costs, full control |
| 4. **Instructor-only code indexing** β Reduces noise, focuses on solutions |
| 5. **Page/slide-level granularity** β Precise citations, better UX |
| 6. **Separate code chunking strategy** β Respects function boundaries |
| 7. **Modular architecture** β Each component independently testable |
|
|
| --- |
|
|
| **Implementation Status: READY FOR TESTING** β |
|
|
| All core functionality implemented. System is ready for: |
| - Installing dependencies |
| - Running ingestion pipeline |
| - Testing with CLI and web UI |
| - Integration into course workflow |
|
|