Spaces:
Running
Running
RAG Capstone Project - Code Organization Guide
π Directory Structure After Code Review
RAG Capstone Project/
β
ββ π― CORE APPLICATION (Active Production Code)
β ββ run.py ............................ Quick start launcher
β ββ streamlit_app.py .................. Main web interface
β ββ config.py ......................... Settings & configuration
β ββ vector_store.py ................... ChromaDB vector database
β ββ llm_client.py ..................... Groq LLM integration
β ββ embedding_models.py ............... Embedding factory pattern
β ββ chunking_strategies.py ............ Document chunking strategies
β ββ dataset_loader.py ................. RAGBench dataset loader
β ββ trace_evaluator.py ................ TRACE metric evaluation
β ββ advanced_rag_evaluator.py ......... Advanced metrics (RMSE, AUC)
β ββ evaluation_pipeline.py ............ Evaluation orchestration
β
ββ π οΈ UTILITIES & RECOVERY (Maintenance Tools)
β ββ rebuild_chroma_index.py ........... Rebuild ChromaDB indices
β ββ rebuild_sqlite_direct.py .......... Direct SQLite rebuild
β ββ recover_chroma_advanced.py ........ Advanced recovery utility
β ββ recover_collections.py ............ Collection recovery
β ββ rename_collections.py ............. Collection renaming
β ββ reset_sqlite_index.py ............. Index reset utility
β
ββ π§ͺ TEST SCRIPTS
β ββ test_llm_audit_trail.py ........... Audit trail testing
β ββ test_rmse_aggregation.py .......... RMSE metric testing
β
ββ π¦ ARCHIVED (Moved from Main Directory)
β ββ archived_scripts/
β ββ api.py ......................... [UNUSED] FastAPI implementation
β ββ audit_collection_names.py ...... [UNUSED] SQLite audit tool
β ββ cleanup_chroma.py .............. [UNUSED] Cleanup utility
β ββ create_architecture_diagram.py . [UNUSED] Diagram generator
β ββ create_ppt_presentation.py ..... [UNUSED] PPT generator
β ββ create_trace_flow_diagrams.py .. [UNUSED] Flow diagram generator
β ββ example.py ..................... [UNUSED] Example usage
β ββ README.md ...................... Archive documentation
β
ββ βοΈ CONFIGURATION & DEPLOYMENT
β ββ .env ............................... Environment variables (local)
β ββ .env.example ....................... Example environment
β ββ requirements.txt ................... Python dependencies
β ββ docker-compose.yml ................. Docker orchestration
β ββ Dockerfile ......................... Container definition
β ββ Procfile ........................... Heroku/deployment manifest
β
ββ πΎ DATA & STORAGE
β ββ chroma_db/ ......................... ChromaDB vector storage
β ββ data_cache/ ........................ Cached datasets
β ββ venv/ ............................. Python virtual environment
β ββ __pycache__/ ....................... Python bytecode cache
β
ββ π DOCUMENTATION
β ββ CODE_REVIEW_REPORT.md ............. [NEW] Comprehensive code review
β ββ README.md .......................... Project documentation
β ββ docs/ ............................. Additional documentation
β
ββ π GENERATED OUTPUT
ββ RAG_Architecture_Diagram.png ....... System architecture
ββ RAG_Data_Flow_Diagram.png ......... Data flow visualization
ββ RAG_Capstone_Project_Presentation.pptx ... Presentation slides
ββ Sentence_Mapping_Example.png ....... Example output
π― Quick Reference by Task
Running the Application
python run.py # Quick start launcher
streamlit run streamlit_app.py # Direct web interface
Core System Modules
- Data Pipeline:
dataset_loader.pyβvector_store.pyβembedding_models.py - Query Pipeline:
llm_client.pyβtrace_evaluator.py - Orchestration:
evaluation_pipeline.py(coordinates everything)
Database Maintenance
- Corruption detected? Run recovery scripts:
recover_chroma_advanced.py(recommended first)rebuild_chroma_index.py(full rebuild)recover_collections.py(collection-specific)
Development & Testing
- Test evaluation:
test_llm_audit_trail.py - Test metrics:
test_rmse_aggregation.py - Example usage: See
archived_scripts/example.py
π File Statistics
| Category | Count | Status |
|---|---|---|
| Core Production | 11 | β Active |
| Recovery/Utilities | 6 | β In Use |
| Test Scripts | 2 | β In Use |
| Archived | 7 | π¦ Not Used |
| Configuration | 5 | β In Use |
| Total | 31 | Clean |
π Why Files Were Moved
Archived Files (7 total)
These files do NOT have any imports in the active codebase:
- api.py - Alternative FastAPI backend (not used; main app is Streamlit)
- example.py - Demo script; not part of production pipeline
- Diagram/PPT generators - Documentation tools; run standalone only
- Audit script - Development debugging tool; not in main flow
- Cleanup script - Maintenance utility; not in main flow
Preserved Files (20 total)
These files ARE actively imported:
- Core modules - Required by streamlit_app.py and run.py
- Recovery tools - Critical for database maintenance
- Test scripts - Part of quality assurance process
β Code Review Highlights
Strengths Found
β
Well-structured modular architecture
β
Excellent factory pattern implementation
β
Intelligent rate limiting for API
β
Type-safe configuration with Pydantic
β
Clear separation of concerns
Improvement Recommendations
β οΈ Add structured logging (replace print statements)
β οΈ Improve error handling (too many broad exceptions)
β οΈ Add comprehensive type hints
β οΈ Add input validation
β οΈ Add performance monitoring
See CODE_REVIEW_REPORT.md for detailed analysis and recommendations
π Notes
- Recovery Scripts: These are NOT "unused" - they're critical maintenance tools kept in main directory
- Test Scripts: These are NOT "unused" - they're part of development workflow
- Archive: Safe to delete archived_scripts/ if files are never needed again
- Git: All files remain in git history; no data is lost
Last Updated: January 1, 2026
Status: β
Code cleanup complete and documented