# RAG Capstone Project - Code Organization Guide ## ๐Ÿ“ Directory Structure After Code Review ``` RAG Capstone Project/ โ”‚ โ”œโ”€ ๐ŸŽฏ CORE APPLICATION (Active Production Code) โ”‚ โ”œโ”€ run.py ............................ Quick start launcher โ”‚ โ”œโ”€ streamlit_app.py .................. Main web interface โ”‚ โ”œโ”€ config.py ......................... Settings & configuration โ”‚ โ”œโ”€ vector_store.py ................... ChromaDB vector database โ”‚ โ”œโ”€ llm_client.py ..................... Groq LLM integration โ”‚ โ”œโ”€ embedding_models.py ............... Embedding factory pattern โ”‚ โ”œโ”€ chunking_strategies.py ............ Document chunking strategies โ”‚ โ”œโ”€ dataset_loader.py ................. RAGBench dataset loader โ”‚ โ”œโ”€ trace_evaluator.py ................ TRACE metric evaluation โ”‚ โ”œโ”€ advanced_rag_evaluator.py ......... Advanced metrics (RMSE, AUC) โ”‚ โ””โ”€ evaluation_pipeline.py ............ Evaluation orchestration โ”‚ โ”œโ”€ ๐Ÿ› ๏ธ UTILITIES & RECOVERY (Maintenance Tools) โ”‚ โ”œโ”€ rebuild_chroma_index.py ........... Rebuild ChromaDB indices โ”‚ โ”œโ”€ rebuild_sqlite_direct.py .......... Direct SQLite rebuild โ”‚ โ”œโ”€ recover_chroma_advanced.py ........ Advanced recovery utility โ”‚ โ”œโ”€ recover_collections.py ............ Collection recovery โ”‚ โ”œโ”€ rename_collections.py ............. Collection renaming โ”‚ โ””โ”€ reset_sqlite_index.py ............. Index reset utility โ”‚ โ”œโ”€ ๐Ÿงช TEST SCRIPTS โ”‚ โ”œโ”€ test_llm_audit_trail.py ........... Audit trail testing โ”‚ โ””โ”€ test_rmse_aggregation.py .......... RMSE metric testing โ”‚ โ”œโ”€ ๐Ÿ“ฆ ARCHIVED (Moved from Main Directory) โ”‚ โ””โ”€ archived_scripts/ โ”‚ โ”œโ”€ api.py ......................... [UNUSED] FastAPI implementation โ”‚ โ”œโ”€ audit_collection_names.py ...... [UNUSED] SQLite audit tool โ”‚ โ”œโ”€ cleanup_chroma.py .............. [UNUSED] Cleanup utility โ”‚ โ”œโ”€ create_architecture_diagram.py . [UNUSED] Diagram generator โ”‚ โ”œโ”€ create_ppt_presentation.py ..... [UNUSED] PPT generator โ”‚ โ”œโ”€ create_trace_flow_diagrams.py .. [UNUSED] Flow diagram generator โ”‚ โ”œโ”€ example.py ..................... [UNUSED] Example usage โ”‚ โ””โ”€ README.md ...................... Archive documentation โ”‚ โ”œโ”€ โš™๏ธ CONFIGURATION & DEPLOYMENT โ”‚ โ”œโ”€ .env ............................... Environment variables (local) โ”‚ โ”œโ”€ .env.example ....................... Example environment โ”‚ โ”œโ”€ requirements.txt ................... Python dependencies โ”‚ โ”œโ”€ docker-compose.yml ................. Docker orchestration โ”‚ โ”œโ”€ Dockerfile ......................... Container definition โ”‚ โ””โ”€ Procfile ........................... Heroku/deployment manifest โ”‚ โ”œโ”€ ๐Ÿ’พ DATA & STORAGE โ”‚ โ”œโ”€ chroma_db/ ......................... ChromaDB vector storage โ”‚ โ”œโ”€ data_cache/ ........................ Cached datasets โ”‚ โ”œโ”€ venv/ ............................. Python virtual environment โ”‚ โ””โ”€ __pycache__/ ....................... Python bytecode cache โ”‚ โ”œโ”€ ๐Ÿ“š DOCUMENTATION โ”‚ โ”œโ”€ CODE_REVIEW_REPORT.md ............. [NEW] Comprehensive code review โ”‚ โ”œโ”€ README.md .......................... Project documentation โ”‚ โ””โ”€ docs/ ............................. Additional documentation โ”‚ โ””โ”€ ๐Ÿ“Š GENERATED OUTPUT โ”œโ”€ RAG_Architecture_Diagram.png ....... System architecture โ”œโ”€ RAG_Data_Flow_Diagram.png ......... Data flow visualization โ”œโ”€ RAG_Capstone_Project_Presentation.pptx ... Presentation slides โ””โ”€ Sentence_Mapping_Example.png ....... Example output ``` --- ## ๐ŸŽฏ Quick Reference by Task ### Running the Application ```bash python run.py # Quick start launcher streamlit run streamlit_app.py # Direct web interface ``` ### Core System Modules - **Data Pipeline**: `dataset_loader.py` โ†’ `vector_store.py` โ†’ `embedding_models.py` - **Query Pipeline**: `llm_client.py` โ†’ `trace_evaluator.py` - **Orchestration**: `evaluation_pipeline.py` (coordinates everything) ### Database Maintenance - **Corruption detected?** Run recovery scripts: - `recover_chroma_advanced.py` (recommended first) - `rebuild_chroma_index.py` (full rebuild) - `recover_collections.py` (collection-specific) ### Development & Testing - **Test evaluation**: `test_llm_audit_trail.py` - **Test metrics**: `test_rmse_aggregation.py` - **Example usage**: See `archived_scripts/example.py` --- ## ๐Ÿ“Š File Statistics | Category | Count | Status | |----------|-------|--------| | Core Production | 11 | โœ… Active | | Recovery/Utilities | 6 | โœ… In Use | | Test Scripts | 2 | โœ… In Use | | Archived | 7 | ๐Ÿ“ฆ Not Used | | Configuration | 5 | โœ… In Use | | **Total** | **31** | **Clean** | --- ## ๐Ÿ”„ Why Files Were Moved ### Archived Files (7 total) These files do NOT have any imports in the active codebase: - **api.py** - Alternative FastAPI backend (not used; main app is Streamlit) - **example.py** - Demo script; not part of production pipeline - **Diagram/PPT generators** - Documentation tools; run standalone only - **Audit script** - Development debugging tool; not in main flow - **Cleanup script** - Maintenance utility; not in main flow ### Preserved Files (20 total) These files ARE actively imported: - **Core modules** - Required by streamlit_app.py and run.py - **Recovery tools** - Critical for database maintenance - **Test scripts** - Part of quality assurance process --- ## โœ… Code Review Highlights ### Strengths Found โœ… Well-structured modular architecture โœ… Excellent factory pattern implementation โœ… Intelligent rate limiting for API โœ… Type-safe configuration with Pydantic โœ… Clear separation of concerns ### Improvement Recommendations โš ๏ธ Add structured logging (replace print statements) โš ๏ธ Improve error handling (too many broad exceptions) โš ๏ธ Add comprehensive type hints โš ๏ธ Add input validation โš ๏ธ Add performance monitoring **See CODE_REVIEW_REPORT.md for detailed analysis and recommendations** --- ## ๐Ÿ“ Notes - **Recovery Scripts**: These are NOT "unused" - they're critical maintenance tools kept in main directory - **Test Scripts**: These are NOT "unused" - they're part of development workflow - **Archive**: Safe to delete archived_scripts/ if files are never needed again - **Git**: All files remain in git history; no data is lost --- **Last Updated**: January 1, 2026 **Status**: โœ… Code cleanup complete and documented