Spaces:

HokieBird
/

445-bot

Sleeping

App Files Files Community

445-bot / LOG.md

HokieBird

Deploy RAG update — 2026-03-27 20:23

9b62bba 2 months ago

preview code

raw

history blame contribute delete

6.72 kB

A newer version of the Gradio SDK is available: 6.17.3

Upgrade

RAG Implementation Log

Progress Tracking

Implementation of the RAG system based on PLAN.md — COMPLETE ✓

2026-03-26 — Full Implementation

Phase 1: Environment & Dependencies ✓

Created requirements.txt with all necessary Python packages
Created config.py with:
- Document source paths (lectures, datasheets, app notes, source code)
- Ollama configuration (base URL, model selection)
- Chunking parameters (size, overlap for code and prose)
- ChromaDB persistence settings

Phase 2: Document Ingestion ✓

Implemented ingest/code_loader.py
- Loads instructor solution source files (.cpp, .h, .c)
- Skips student templates (noise reduction)
- Adds source headers for citation tracking
Implemented ingest/pptx_extract.py
- Extracts text from PowerPoint slides
- One document per slide for granular retrieval
- Preserves slide numbers for citations
Implemented ingest/pdf_ocr.py
- Fast path: pdfplumber for native-text PDFs
- Fallback: PaddleOCR for image-heavy/scanned PDFs
- Sparsity detection to choose best extraction method
- Page-level granularity for citations
Implemented ingest/chunker.py
- Overlapping text chunks (langchain RecursiveCharacterTextSplitter)
- Different strategies for code vs prose
- Code separators: function/class boundaries
- Chunk metadata includes source, page, assignment info

Phase 3: Embedding & Vector Store ✓

Implemented vectorstore/embedder.py
- Calls Ollama /api/embeddings endpoint
- Wraps nomic-embed-text model (768-dim vectors)
- Includes error handling with zero-vector fallback
Implemented vectorstore/store.py
- ChromaDB persistent client management
- Custom OllamaEmbeddingFunction class for integration
- add_documents() — store chunks with embeddings
- query() — retrieve top-k similar chunks
- Cosine similarity metric for document retrieval

Phase 4: Query Pipeline ✓

Implemented query/retriever.py
- Simple wrapper around vector store queries
- Configurable top-k retrieval (default 5)
Implemented query/prompt_builder.py
- System prompt guides LLM to use context only
- Formats retrieved chunks with source citations
- Builds structured messages for Ollama chat API
Implemented query/generator.py
- Calls Ollama /api/chat endpoint
- Handles errors gracefully
- Returns response text directly

Phase 5: CLI Scripts ✓

Implemented scripts/ingest_all.py
- Orchestrates full pipeline: extraction → chunking → embedding → storage
- Walks all document directories recursively
- Separates code vs prose for appropriate chunking
- --dry-run flag for OCR quality testing
- Prints summary statistics per category
Implemented scripts/query_cli.py
- Interactive loop for asking questions
- Shows retrieved chunks on --verbose flag
- Displays source citations with each answer
- Clean formatting for terminal output
Implemented scripts/launch_ui.py
- Gradio web interface on localhost:7860
- Text input for questions
- Toggle to show/hide retrieved sources
- User-friendly markdown output for answers

Phase 6: Testing ✓

Implemented tests/test_ingest.py
- Verifies code loader finds instructor files
- Checks that student directories are skipped
- Tests chunking respects size bounds
- Code chunks use appropriate larger sizes
Implemented tests/test_retrieval.py
- Tests ChromaDB collection initialization
- Validates add_documents and query interface
- Checks retrieval respects top-k parameter
- Tests retrieve function structure
Implemented tests/test_end_to_end.py
- Full pipeline interface tests
- Prompt building with context validation
- Generation interface verification
- Graceful skipping when Ollama unavailable

Documentation & Configuration ✓

Created .gitignore to exclude:
- Virtual environment
- ChromaDB persistent storage
- Cache and build artifacts
Created README.md with:
- Quick start guide
- Installation instructions
- Configuration options
- Example queries
- Architecture diagram
- Troubleshooting guide
- Known limitations

Implementation Statistics

Total Python files: 17
Total lines of code: ~1400
Phases completed: 6/6 ✓

Directory Structure (Final)

rag/
├── README.md                    # User guide
├── PLAN.md                      # Architecture plan
├── LOG.md                       # This file
├── requirements.txt             # Python dependencies
├── config.py                    # Centralized configuration
├── .gitignore                   # Git exclusions
├── ingest/                      # Document extraction
│   ├── __init__.py
│   ├── code_loader.py
│   ├── pptx_extract.py
│   ├── pdf_ocr.py
│   └── chunker.py
├── vectorstore/                 # Vector storage
│   ├── __init__.py
│   ├── embedder.py
│   └── store.py
├── query/                       # Query pipeline
│   ├── __init__.py
│   ├── retriever.py
│   ├── prompt_builder.py
│   └── generator.py
├── scripts/                     # CLI tools
│   ├── ingest_all.py
│   ├── query_cli.py
│   └── launch_ui.py
├── tests/                       # Test suite
│   ├── __init__.py
│   ├── test_ingest.py
│   ├── test_retrieval.py
│   └── test_end_to_end.py
└── chroma_db/                   # Vector storage (gitignored)
    └── [ChromaDB data]

Next Steps for Usage

Install dependencies: pip install -r requirements.txt
Ensure Ollama is running: ollama serve
Ingest documents: python scripts/ingest_all.py
Query:
- CLI: python scripts/query_cli.py
- Web UI: python scripts/launch_ui.py

Key Design Decisions

Pdfplumber + PaddleOCR fallback — Fast for native PDFs, handles scanned documents
ChromaDB — Embedded vector store, no server needed, persistent on disk
Ollama local inference — Privacy-respecting, no API costs, full control
Instructor-only code indexing — Reduces noise, focuses on solutions
Page/slide-level granularity — Precise citations, better UX
Separate code chunking strategy — Respects function boundaries
Modular architecture — Each component independently testable

Implementation Status: READY FOR TESTING ✓

All core functionality implemented. System is ready for:

Installing dependencies
Running ingestion pipeline
Testing with CLI and web UI
Integration into course workflow