CapStoneRAG10 / docs /ORGANIZATION_GUIDE.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

RAG Capstone Project - Code Organization Guide

πŸ“ Directory Structure After Code Review

RAG Capstone Project/
β”‚
β”œβ”€ 🎯 CORE APPLICATION (Active Production Code)
β”‚  β”œβ”€ run.py ............................ Quick start launcher
β”‚  β”œβ”€ streamlit_app.py .................. Main web interface
β”‚  β”œβ”€ config.py ......................... Settings & configuration
β”‚  β”œβ”€ vector_store.py ................... ChromaDB vector database
β”‚  β”œβ”€ llm_client.py ..................... Groq LLM integration
β”‚  β”œβ”€ embedding_models.py ............... Embedding factory pattern
β”‚  β”œβ”€ chunking_strategies.py ............ Document chunking strategies
β”‚  β”œβ”€ dataset_loader.py ................. RAGBench dataset loader
β”‚  β”œβ”€ trace_evaluator.py ................ TRACE metric evaluation
β”‚  β”œβ”€ advanced_rag_evaluator.py ......... Advanced metrics (RMSE, AUC)
β”‚  └─ evaluation_pipeline.py ............ Evaluation orchestration
β”‚
β”œβ”€ πŸ› οΈ UTILITIES & RECOVERY (Maintenance Tools)
β”‚  β”œβ”€ rebuild_chroma_index.py ........... Rebuild ChromaDB indices
β”‚  β”œβ”€ rebuild_sqlite_direct.py .......... Direct SQLite rebuild
β”‚  β”œβ”€ recover_chroma_advanced.py ........ Advanced recovery utility
β”‚  β”œβ”€ recover_collections.py ............ Collection recovery
β”‚  β”œβ”€ rename_collections.py ............. Collection renaming
β”‚  └─ reset_sqlite_index.py ............. Index reset utility
β”‚
β”œβ”€ πŸ§ͺ TEST SCRIPTS
β”‚  β”œβ”€ test_llm_audit_trail.py ........... Audit trail testing
β”‚  └─ test_rmse_aggregation.py .......... RMSE metric testing
β”‚
β”œβ”€ πŸ“¦ ARCHIVED (Moved from Main Directory)
β”‚  └─ archived_scripts/
β”‚     β”œβ”€ api.py ......................... [UNUSED] FastAPI implementation
β”‚     β”œβ”€ audit_collection_names.py ...... [UNUSED] SQLite audit tool
β”‚     β”œβ”€ cleanup_chroma.py .............. [UNUSED] Cleanup utility
β”‚     β”œβ”€ create_architecture_diagram.py . [UNUSED] Diagram generator
β”‚     β”œβ”€ create_ppt_presentation.py ..... [UNUSED] PPT generator
β”‚     β”œβ”€ create_trace_flow_diagrams.py .. [UNUSED] Flow diagram generator
β”‚     β”œβ”€ example.py ..................... [UNUSED] Example usage
β”‚     └─ README.md ...................... Archive documentation
β”‚
β”œβ”€ βš™οΈ CONFIGURATION & DEPLOYMENT
β”‚  β”œβ”€ .env ............................... Environment variables (local)
β”‚  β”œβ”€ .env.example ....................... Example environment
β”‚  β”œβ”€ requirements.txt ................... Python dependencies
β”‚  β”œβ”€ docker-compose.yml ................. Docker orchestration
β”‚  β”œβ”€ Dockerfile ......................... Container definition
β”‚  └─ Procfile ........................... Heroku/deployment manifest
β”‚
β”œβ”€ πŸ’Ύ DATA & STORAGE
β”‚  β”œβ”€ chroma_db/ ......................... ChromaDB vector storage
β”‚  β”œβ”€ data_cache/ ........................ Cached datasets
β”‚  β”œβ”€ venv/ ............................. Python virtual environment
β”‚  └─ __pycache__/ ....................... Python bytecode cache
β”‚
β”œβ”€ πŸ“š DOCUMENTATION
β”‚  β”œβ”€ CODE_REVIEW_REPORT.md ............. [NEW] Comprehensive code review
β”‚  β”œβ”€ README.md .......................... Project documentation
β”‚  └─ docs/ ............................. Additional documentation
β”‚
└─ πŸ“Š GENERATED OUTPUT
   β”œβ”€ RAG_Architecture_Diagram.png ....... System architecture
   β”œβ”€ RAG_Data_Flow_Diagram.png ......... Data flow visualization
   β”œβ”€ RAG_Capstone_Project_Presentation.pptx ... Presentation slides
   └─ Sentence_Mapping_Example.png ....... Example output

🎯 Quick Reference by Task

Running the Application

python run.py              # Quick start launcher
streamlit run streamlit_app.py  # Direct web interface

Core System Modules

  • Data Pipeline: dataset_loader.py β†’ vector_store.py β†’ embedding_models.py
  • Query Pipeline: llm_client.py β†’ trace_evaluator.py
  • Orchestration: evaluation_pipeline.py (coordinates everything)

Database Maintenance

  • Corruption detected? Run recovery scripts:
    • recover_chroma_advanced.py (recommended first)
    • rebuild_chroma_index.py (full rebuild)
    • recover_collections.py (collection-specific)

Development & Testing

  • Test evaluation: test_llm_audit_trail.py
  • Test metrics: test_rmse_aggregation.py
  • Example usage: See archived_scripts/example.py

πŸ“Š File Statistics

Category Count Status
Core Production 11 βœ… Active
Recovery/Utilities 6 βœ… In Use
Test Scripts 2 βœ… In Use
Archived 7 πŸ“¦ Not Used
Configuration 5 βœ… In Use
Total 31 Clean

πŸ”„ Why Files Were Moved

Archived Files (7 total)

These files do NOT have any imports in the active codebase:

  • api.py - Alternative FastAPI backend (not used; main app is Streamlit)
  • example.py - Demo script; not part of production pipeline
  • Diagram/PPT generators - Documentation tools; run standalone only
  • Audit script - Development debugging tool; not in main flow
  • Cleanup script - Maintenance utility; not in main flow

Preserved Files (20 total)

These files ARE actively imported:

  • Core modules - Required by streamlit_app.py and run.py
  • Recovery tools - Critical for database maintenance
  • Test scripts - Part of quality assurance process

βœ… Code Review Highlights

Strengths Found

βœ… Well-structured modular architecture
βœ… Excellent factory pattern implementation
βœ… Intelligent rate limiting for API
βœ… Type-safe configuration with Pydantic
βœ… Clear separation of concerns

Improvement Recommendations

⚠️ Add structured logging (replace print statements)
⚠️ Improve error handling (too many broad exceptions)
⚠️ Add comprehensive type hints
⚠️ Add input validation
⚠️ Add performance monitoring

See CODE_REVIEW_REPORT.md for detailed analysis and recommendations


πŸ“ Notes

  • Recovery Scripts: These are NOT "unused" - they're critical maintenance tools kept in main directory
  • Test Scripts: These are NOT "unused" - they're part of development workflow
  • Archive: Safe to delete archived_scripts/ if files are never needed again
  • Git: All files remain in git history; no data is lost

Last Updated: January 1, 2026
Status: βœ… Code cleanup complete and documented