CapStoneRAG10 / docs /ORGANIZATION_GUIDE.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# RAG Capstone Project - Code Organization Guide
## πŸ“ Directory Structure After Code Review
```
RAG Capstone Project/
β”‚
β”œβ”€ 🎯 CORE APPLICATION (Active Production Code)
β”‚ β”œβ”€ run.py ............................ Quick start launcher
β”‚ β”œβ”€ streamlit_app.py .................. Main web interface
β”‚ β”œβ”€ config.py ......................... Settings & configuration
β”‚ β”œβ”€ vector_store.py ................... ChromaDB vector database
β”‚ β”œβ”€ llm_client.py ..................... Groq LLM integration
β”‚ β”œβ”€ embedding_models.py ............... Embedding factory pattern
β”‚ β”œβ”€ chunking_strategies.py ............ Document chunking strategies
β”‚ β”œβ”€ dataset_loader.py ................. RAGBench dataset loader
β”‚ β”œβ”€ trace_evaluator.py ................ TRACE metric evaluation
β”‚ β”œβ”€ advanced_rag_evaluator.py ......... Advanced metrics (RMSE, AUC)
β”‚ └─ evaluation_pipeline.py ............ Evaluation orchestration
β”‚
β”œβ”€ πŸ› οΈ UTILITIES & RECOVERY (Maintenance Tools)
β”‚ β”œβ”€ rebuild_chroma_index.py ........... Rebuild ChromaDB indices
β”‚ β”œβ”€ rebuild_sqlite_direct.py .......... Direct SQLite rebuild
β”‚ β”œβ”€ recover_chroma_advanced.py ........ Advanced recovery utility
β”‚ β”œβ”€ recover_collections.py ............ Collection recovery
β”‚ β”œβ”€ rename_collections.py ............. Collection renaming
β”‚ └─ reset_sqlite_index.py ............. Index reset utility
β”‚
β”œβ”€ πŸ§ͺ TEST SCRIPTS
β”‚ β”œβ”€ test_llm_audit_trail.py ........... Audit trail testing
β”‚ └─ test_rmse_aggregation.py .......... RMSE metric testing
β”‚
β”œβ”€ πŸ“¦ ARCHIVED (Moved from Main Directory)
β”‚ └─ archived_scripts/
β”‚ β”œβ”€ api.py ......................... [UNUSED] FastAPI implementation
β”‚ β”œβ”€ audit_collection_names.py ...... [UNUSED] SQLite audit tool
β”‚ β”œβ”€ cleanup_chroma.py .............. [UNUSED] Cleanup utility
β”‚ β”œβ”€ create_architecture_diagram.py . [UNUSED] Diagram generator
β”‚ β”œβ”€ create_ppt_presentation.py ..... [UNUSED] PPT generator
β”‚ β”œβ”€ create_trace_flow_diagrams.py .. [UNUSED] Flow diagram generator
β”‚ β”œβ”€ example.py ..................... [UNUSED] Example usage
β”‚ └─ README.md ...................... Archive documentation
β”‚
β”œβ”€ βš™οΈ CONFIGURATION & DEPLOYMENT
β”‚ β”œβ”€ .env ............................... Environment variables (local)
β”‚ β”œβ”€ .env.example ....................... Example environment
β”‚ β”œβ”€ requirements.txt ................... Python dependencies
β”‚ β”œβ”€ docker-compose.yml ................. Docker orchestration
β”‚ β”œβ”€ Dockerfile ......................... Container definition
β”‚ └─ Procfile ........................... Heroku/deployment manifest
β”‚
β”œβ”€ πŸ’Ύ DATA & STORAGE
β”‚ β”œβ”€ chroma_db/ ......................... ChromaDB vector storage
β”‚ β”œβ”€ data_cache/ ........................ Cached datasets
β”‚ β”œβ”€ venv/ ............................. Python virtual environment
β”‚ └─ __pycache__/ ....................... Python bytecode cache
β”‚
β”œβ”€ πŸ“š DOCUMENTATION
β”‚ β”œβ”€ CODE_REVIEW_REPORT.md ............. [NEW] Comprehensive code review
β”‚ β”œβ”€ README.md .......................... Project documentation
β”‚ └─ docs/ ............................. Additional documentation
β”‚
└─ πŸ“Š GENERATED OUTPUT
β”œβ”€ RAG_Architecture_Diagram.png ....... System architecture
β”œβ”€ RAG_Data_Flow_Diagram.png ......... Data flow visualization
β”œβ”€ RAG_Capstone_Project_Presentation.pptx ... Presentation slides
└─ Sentence_Mapping_Example.png ....... Example output
```
---
## 🎯 Quick Reference by Task
### Running the Application
```bash
python run.py # Quick start launcher
streamlit run streamlit_app.py # Direct web interface
```
### Core System Modules
- **Data Pipeline**: `dataset_loader.py` β†’ `vector_store.py` β†’ `embedding_models.py`
- **Query Pipeline**: `llm_client.py` β†’ `trace_evaluator.py`
- **Orchestration**: `evaluation_pipeline.py` (coordinates everything)
### Database Maintenance
- **Corruption detected?** Run recovery scripts:
- `recover_chroma_advanced.py` (recommended first)
- `rebuild_chroma_index.py` (full rebuild)
- `recover_collections.py` (collection-specific)
### Development & Testing
- **Test evaluation**: `test_llm_audit_trail.py`
- **Test metrics**: `test_rmse_aggregation.py`
- **Example usage**: See `archived_scripts/example.py`
---
## πŸ“Š File Statistics
| Category | Count | Status |
|----------|-------|--------|
| Core Production | 11 | βœ… Active |
| Recovery/Utilities | 6 | βœ… In Use |
| Test Scripts | 2 | βœ… In Use |
| Archived | 7 | πŸ“¦ Not Used |
| Configuration | 5 | βœ… In Use |
| **Total** | **31** | **Clean** |
---
## πŸ”„ Why Files Were Moved
### Archived Files (7 total)
These files do NOT have any imports in the active codebase:
- **api.py** - Alternative FastAPI backend (not used; main app is Streamlit)
- **example.py** - Demo script; not part of production pipeline
- **Diagram/PPT generators** - Documentation tools; run standalone only
- **Audit script** - Development debugging tool; not in main flow
- **Cleanup script** - Maintenance utility; not in main flow
### Preserved Files (20 total)
These files ARE actively imported:
- **Core modules** - Required by streamlit_app.py and run.py
- **Recovery tools** - Critical for database maintenance
- **Test scripts** - Part of quality assurance process
---
## βœ… Code Review Highlights
### Strengths Found
βœ… Well-structured modular architecture
βœ… Excellent factory pattern implementation
βœ… Intelligent rate limiting for API
βœ… Type-safe configuration with Pydantic
βœ… Clear separation of concerns
### Improvement Recommendations
⚠️ Add structured logging (replace print statements)
⚠️ Improve error handling (too many broad exceptions)
⚠️ Add comprehensive type hints
⚠️ Add input validation
⚠️ Add performance monitoring
**See CODE_REVIEW_REPORT.md for detailed analysis and recommendations**
---
## πŸ“ Notes
- **Recovery Scripts**: These are NOT "unused" - they're critical maintenance tools kept in main directory
- **Test Scripts**: These are NOT "unused" - they're part of development workflow
- **Archive**: Safe to delete archived_scripts/ if files are never needed again
- **Git**: All files remain in git history; no data is lost
---
**Last Updated**: January 1, 2026
**Status**: βœ… Code cleanup complete and documented