Spaces:
Sleeping
Sleeping
File size: 6,542 Bytes
1d10b0a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
# RAG Capstone Project - Code Organization Guide
## π Directory Structure After Code Review
```
RAG Capstone Project/
β
ββ π― CORE APPLICATION (Active Production Code)
β ββ run.py ............................ Quick start launcher
β ββ streamlit_app.py .................. Main web interface
β ββ config.py ......................... Settings & configuration
β ββ vector_store.py ................... ChromaDB vector database
β ββ llm_client.py ..................... Groq LLM integration
β ββ embedding_models.py ............... Embedding factory pattern
β ββ chunking_strategies.py ............ Document chunking strategies
β ββ dataset_loader.py ................. RAGBench dataset loader
β ββ trace_evaluator.py ................ TRACE metric evaluation
β ββ advanced_rag_evaluator.py ......... Advanced metrics (RMSE, AUC)
β ββ evaluation_pipeline.py ............ Evaluation orchestration
β
ββ π οΈ UTILITIES & RECOVERY (Maintenance Tools)
β ββ rebuild_chroma_index.py ........... Rebuild ChromaDB indices
β ββ rebuild_sqlite_direct.py .......... Direct SQLite rebuild
β ββ recover_chroma_advanced.py ........ Advanced recovery utility
β ββ recover_collections.py ............ Collection recovery
β ββ rename_collections.py ............. Collection renaming
β ββ reset_sqlite_index.py ............. Index reset utility
β
ββ π§ͺ TEST SCRIPTS
β ββ test_llm_audit_trail.py ........... Audit trail testing
β ββ test_rmse_aggregation.py .......... RMSE metric testing
β
ββ π¦ ARCHIVED (Moved from Main Directory)
β ββ archived_scripts/
β ββ api.py ......................... [UNUSED] FastAPI implementation
β ββ audit_collection_names.py ...... [UNUSED] SQLite audit tool
β ββ cleanup_chroma.py .............. [UNUSED] Cleanup utility
β ββ create_architecture_diagram.py . [UNUSED] Diagram generator
β ββ create_ppt_presentation.py ..... [UNUSED] PPT generator
β ββ create_trace_flow_diagrams.py .. [UNUSED] Flow diagram generator
β ββ example.py ..................... [UNUSED] Example usage
β ββ README.md ...................... Archive documentation
β
ββ βοΈ CONFIGURATION & DEPLOYMENT
β ββ .env ............................... Environment variables (local)
β ββ .env.example ....................... Example environment
β ββ requirements.txt ................... Python dependencies
β ββ docker-compose.yml ................. Docker orchestration
β ββ Dockerfile ......................... Container definition
β ββ Procfile ........................... Heroku/deployment manifest
β
ββ πΎ DATA & STORAGE
β ββ chroma_db/ ......................... ChromaDB vector storage
β ββ data_cache/ ........................ Cached datasets
β ββ venv/ ............................. Python virtual environment
β ββ __pycache__/ ....................... Python bytecode cache
β
ββ π DOCUMENTATION
β ββ CODE_REVIEW_REPORT.md ............. [NEW] Comprehensive code review
β ββ README.md .......................... Project documentation
β ββ docs/ ............................. Additional documentation
β
ββ π GENERATED OUTPUT
ββ RAG_Architecture_Diagram.png ....... System architecture
ββ RAG_Data_Flow_Diagram.png ......... Data flow visualization
ββ RAG_Capstone_Project_Presentation.pptx ... Presentation slides
ββ Sentence_Mapping_Example.png ....... Example output
```
---
## π― Quick Reference by Task
### Running the Application
```bash
python run.py # Quick start launcher
streamlit run streamlit_app.py # Direct web interface
```
### Core System Modules
- **Data Pipeline**: `dataset_loader.py` β `vector_store.py` β `embedding_models.py`
- **Query Pipeline**: `llm_client.py` β `trace_evaluator.py`
- **Orchestration**: `evaluation_pipeline.py` (coordinates everything)
### Database Maintenance
- **Corruption detected?** Run recovery scripts:
- `recover_chroma_advanced.py` (recommended first)
- `rebuild_chroma_index.py` (full rebuild)
- `recover_collections.py` (collection-specific)
### Development & Testing
- **Test evaluation**: `test_llm_audit_trail.py`
- **Test metrics**: `test_rmse_aggregation.py`
- **Example usage**: See `archived_scripts/example.py`
---
## π File Statistics
| Category | Count | Status |
|----------|-------|--------|
| Core Production | 11 | β
Active |
| Recovery/Utilities | 6 | β
In Use |
| Test Scripts | 2 | β
In Use |
| Archived | 7 | π¦ Not Used |
| Configuration | 5 | β
In Use |
| **Total** | **31** | **Clean** |
---
## π Why Files Were Moved
### Archived Files (7 total)
These files do NOT have any imports in the active codebase:
- **api.py** - Alternative FastAPI backend (not used; main app is Streamlit)
- **example.py** - Demo script; not part of production pipeline
- **Diagram/PPT generators** - Documentation tools; run standalone only
- **Audit script** - Development debugging tool; not in main flow
- **Cleanup script** - Maintenance utility; not in main flow
### Preserved Files (20 total)
These files ARE actively imported:
- **Core modules** - Required by streamlit_app.py and run.py
- **Recovery tools** - Critical for database maintenance
- **Test scripts** - Part of quality assurance process
---
## β
Code Review Highlights
### Strengths Found
β
Well-structured modular architecture
β
Excellent factory pattern implementation
β
Intelligent rate limiting for API
β
Type-safe configuration with Pydantic
β
Clear separation of concerns
### Improvement Recommendations
β οΈ Add structured logging (replace print statements)
β οΈ Improve error handling (too many broad exceptions)
β οΈ Add comprehensive type hints
β οΈ Add input validation
β οΈ Add performance monitoring
**See CODE_REVIEW_REPORT.md for detailed analysis and recommendations**
---
## π Notes
- **Recovery Scripts**: These are NOT "unused" - they're critical maintenance tools kept in main directory
- **Test Scripts**: These are NOT "unused" - they're part of development workflow
- **Archive**: Safe to delete archived_scripts/ if files are never needed again
- **Git**: All files remain in git history; no data is lost
---
**Last Updated**: January 1, 2026
**Status**: β
Code cleanup complete and documented
|