445-bot / LOG.md
HokieBird's picture
Deploy RAG update β€” 2026-03-27 20:23
9b62bba

A newer version of the Gradio SDK is available: 6.17.3

Upgrade

RAG Implementation Log

Progress Tracking

Implementation of the RAG system based on PLAN.md β€” COMPLETE βœ“

2026-03-26 β€” Full Implementation

Phase 1: Environment & Dependencies βœ“

  • Created requirements.txt with all necessary Python packages
  • Created config.py with:
    • Document source paths (lectures, datasheets, app notes, source code)
    • Ollama configuration (base URL, model selection)
    • Chunking parameters (size, overlap for code and prose)
    • ChromaDB persistence settings

Phase 2: Document Ingestion βœ“

  • Implemented ingest/code_loader.py

    • Loads instructor solution source files (.cpp, .h, .c)
    • Skips student templates (noise reduction)
    • Adds source headers for citation tracking
  • Implemented ingest/pptx_extract.py

    • Extracts text from PowerPoint slides
    • One document per slide for granular retrieval
    • Preserves slide numbers for citations
  • Implemented ingest/pdf_ocr.py

    • Fast path: pdfplumber for native-text PDFs
    • Fallback: PaddleOCR for image-heavy/scanned PDFs
    • Sparsity detection to choose best extraction method
    • Page-level granularity for citations
  • Implemented ingest/chunker.py

    • Overlapping text chunks (langchain RecursiveCharacterTextSplitter)
    • Different strategies for code vs prose
    • Code separators: function/class boundaries
    • Chunk metadata includes source, page, assignment info

Phase 3: Embedding & Vector Store βœ“

  • Implemented vectorstore/embedder.py

    • Calls Ollama /api/embeddings endpoint
    • Wraps nomic-embed-text model (768-dim vectors)
    • Includes error handling with zero-vector fallback
  • Implemented vectorstore/store.py

    • ChromaDB persistent client management
    • Custom OllamaEmbeddingFunction class for integration
    • add_documents() β€” store chunks with embeddings
    • query() β€” retrieve top-k similar chunks
    • Cosine similarity metric for document retrieval

Phase 4: Query Pipeline βœ“

  • Implemented query/retriever.py

    • Simple wrapper around vector store queries
    • Configurable top-k retrieval (default 5)
  • Implemented query/prompt_builder.py

    • System prompt guides LLM to use context only
    • Formats retrieved chunks with source citations
    • Builds structured messages for Ollama chat API
  • Implemented query/generator.py

    • Calls Ollama /api/chat endpoint
    • Handles errors gracefully
    • Returns response text directly

Phase 5: CLI Scripts βœ“

  • Implemented scripts/ingest_all.py

    • Orchestrates full pipeline: extraction β†’ chunking β†’ embedding β†’ storage
    • Walks all document directories recursively
    • Separates code vs prose for appropriate chunking
    • --dry-run flag for OCR quality testing
    • Prints summary statistics per category
  • Implemented scripts/query_cli.py

    • Interactive loop for asking questions
    • Shows retrieved chunks on --verbose flag
    • Displays source citations with each answer
    • Clean formatting for terminal output
  • Implemented scripts/launch_ui.py

    • Gradio web interface on localhost:7860
    • Text input for questions
    • Toggle to show/hide retrieved sources
    • User-friendly markdown output for answers

Phase 6: Testing βœ“

  • Implemented tests/test_ingest.py

    • Verifies code loader finds instructor files
    • Checks that student directories are skipped
    • Tests chunking respects size bounds
    • Code chunks use appropriate larger sizes
  • Implemented tests/test_retrieval.py

    • Tests ChromaDB collection initialization
    • Validates add_documents and query interface
    • Checks retrieval respects top-k parameter
    • Tests retrieve function structure
  • Implemented tests/test_end_to_end.py

    • Full pipeline interface tests
    • Prompt building with context validation
    • Generation interface verification
    • Graceful skipping when Ollama unavailable

Documentation & Configuration βœ“

  • Created .gitignore to exclude:

    • Virtual environment
    • ChromaDB persistent storage
    • Cache and build artifacts
  • Created README.md with:

    • Quick start guide
    • Installation instructions
    • Configuration options
    • Example queries
    • Architecture diagram
    • Troubleshooting guide
    • Known limitations

Implementation Statistics

  • Total Python files: 17
  • Total lines of code: ~1400
  • Phases completed: 6/6 βœ“

Directory Structure (Final)

rag/
β”œβ”€β”€ README.md                    # User guide
β”œβ”€β”€ PLAN.md                      # Architecture plan
β”œβ”€β”€ LOG.md                       # This file
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ config.py                    # Centralized configuration
β”œβ”€β”€ .gitignore                   # Git exclusions
β”œβ”€β”€ ingest/                      # Document extraction
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ code_loader.py
β”‚   β”œβ”€β”€ pptx_extract.py
β”‚   β”œβ”€β”€ pdf_ocr.py
β”‚   └── chunker.py
β”œβ”€β”€ vectorstore/                 # Vector storage
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ embedder.py
β”‚   └── store.py
β”œβ”€β”€ query/                       # Query pipeline
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ retriever.py
β”‚   β”œβ”€β”€ prompt_builder.py
β”‚   └── generator.py
β”œβ”€β”€ scripts/                     # CLI tools
β”‚   β”œβ”€β”€ ingest_all.py
β”‚   β”œβ”€β”€ query_cli.py
β”‚   └── launch_ui.py
β”œβ”€β”€ tests/                       # Test suite
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ test_ingest.py
β”‚   β”œβ”€β”€ test_retrieval.py
β”‚   └── test_end_to_end.py
└── chroma_db/                   # Vector storage (gitignored)
    └── [ChromaDB data]

Next Steps for Usage

  1. Install dependencies: pip install -r requirements.txt
  2. Ensure Ollama is running: ollama serve
  3. Ingest documents: python scripts/ingest_all.py
  4. Query:
    • CLI: python scripts/query_cli.py
    • Web UI: python scripts/launch_ui.py

Key Design Decisions

  1. Pdfplumber + PaddleOCR fallback β€” Fast for native PDFs, handles scanned documents
  2. ChromaDB β€” Embedded vector store, no server needed, persistent on disk
  3. Ollama local inference β€” Privacy-respecting, no API costs, full control
  4. Instructor-only code indexing β€” Reduces noise, focuses on solutions
  5. Page/slide-level granularity β€” Precise citations, better UX
  6. Separate code chunking strategy β€” Respects function boundaries
  7. Modular architecture β€” Each component independently testable

Implementation Status: READY FOR TESTING βœ“

All core functionality implemented. System is ready for:

  • Installing dependencies
  • Running ingestion pipeline
  • Testing with CLI and web UI
  • Integration into course workflow