Spaces:
Runtime error
Runtime error
initial commit
Browse filesThis view is limited to 50 files because it contains too many changes. Β
See raw diff
- MIGRATION_GUIDE.md +0 -81
- NEXT_STEPS.md +0 -174
- diagrams/1-indexing-flow.mmd +0 -28
- diagrams/1-indexing-flow.svg +99 -1
- diagrams/2-query-flow-medium.mmd +0 -25
- diagrams/2-query-flow-medium.svg +106 -1
- diagrams/2-query-flow-simple.mmd +0 -19
- diagrams/2-query-flow-simple.svg +105 -1
- diagrams/2-query-flow.mmd +0 -39
- diagrams/2-query-flow.svg +138 -1
- diagrams/3-web-endpoint-flow.mmd +0 -26
- diagrams/3-web-endpoint-flow.svg +94 -1
- diagrams/4-container-lifecycle.mmd +0 -31
- diagrams/4-container-lifecycle.svg +118 -1
- diagrams/finetuning.svg +101 -166
- docs/NEXT_STEPS.md +181 -0
- QUICK_START.md β docs/QUICK_START.md +0 -0
- docs/QUICK_START_API.md +75 -0
- docs/README.md +58 -0
- README_RAG.md β docs/README_RAG.md +0 -0
- STRUCTURE.md β docs/STRUCTURE.md +0 -0
- TESTING.md β docs/TESTING.md +0 -0
- VLLM_MIGRATION.md β docs/VLLM_MIGRATION.md +0 -0
- docs/api/RAG_API.md +244 -0
- docs/deployment/ADD_GUIDES_TO_RAG.md +0 -146
- docs/guides/HOW_TO_RUN.md +0 -215
- docs/guides/SETUP_SUCCESS.md +0 -63
- docs/guides/SUMMARY.md +0 -114
- docs/guides/modal-rag-optimization.md +0 -370
- docs/guides/modal-rag-sequence.md +0 -168
- docs/guides/next_steps_rag_recommendation.md +0 -77
- {scripts β src}/__init__.py +0 -0
- {docs β src/data}/clean_sample.py +0 -0
- {scripts β src}/data/cleanup_data.py +0 -0
- {scripts β src}/data/clear_census_volume.py +0 -0
- {scripts β src}/data/convert_census_to_csv.py +0 -0
- {scripts β src}/data/convert_economy_labor_to_csv.py +0 -0
- {scripts β src}/data/convert_to_word.py +0 -0
- {scripts β src}/data/create_custom_qa.py +0 -0
- {docs β src/data}/debug_parser.py +0 -0
- {scripts β src}/data/delete_census_csvs.py +0 -0
- {scripts β src}/data/download_census_api.py +0 -0
- {scripts β src}/data/download_census_csv_modal.py +0 -0
- {scripts β src}/data/download_census_data.py +0 -0
- {scripts β src}/data/download_census_modal.py +0 -0
- {scripts β src}/data/download_economy_labor_modal.py +0 -0
- {scripts β src}/data/fix_csv_filenames.py +0 -0
- {scripts β src}/data/prepare_economy_data.py +0 -0
- {scripts β src}/data/prepare_finetune_data.py +0 -0
- {scripts β src}/data/remove_duplicate_csvs.py +0 -0
MIGRATION_GUIDE.md
DELETED
|
@@ -1,81 +0,0 @@
|
|
| 1 |
-
# Repository Restructure Migration Guide
|
| 2 |
-
|
| 3 |
-
## What Changed
|
| 4 |
-
|
| 5 |
-
The repository has been reorganized for better structure and maintainability.
|
| 6 |
-
|
| 7 |
-
## File Moves
|
| 8 |
-
|
| 9 |
-
### RAG System
|
| 10 |
-
- `src/modal-rag.py` β `src/rag/modal-rag.py`
|
| 11 |
-
- `src/modal-rag-product-design.py` β `src/rag/modal-rag-product-design.py`
|
| 12 |
-
|
| 13 |
-
### Web Application
|
| 14 |
-
- `web_app.py` β `src/web/web_app.py`
|
| 15 |
-
- `query_product_design.py` β `src/web/query_product_design.py`
|
| 16 |
-
- `templates/` β `src/web/templates/`
|
| 17 |
-
- `static/` β `src/web/static/`
|
| 18 |
-
|
| 19 |
-
### Scripts
|
| 20 |
-
- Data processing scripts β `scripts/data/`
|
| 21 |
-
- Setup scripts β `scripts/setup/`
|
| 22 |
-
- Utility scripts β `scripts/tools/`
|
| 23 |
-
|
| 24 |
-
### Documentation
|
| 25 |
-
- All `.md` files β `docs/guides/`
|
| 26 |
-
- Product design docs β `docs/product-design/`
|
| 27 |
-
|
| 28 |
-
### Tests
|
| 29 |
-
- `test_*.py` β `tests/`
|
| 30 |
-
|
| 31 |
-
## Updated Commands
|
| 32 |
-
|
| 33 |
-
### Old Commands (No longer work)
|
| 34 |
-
```bash
|
| 35 |
-
python web_app.py
|
| 36 |
-
modal run src/modal-rag-product-design.py::query_product_design
|
| 37 |
-
```
|
| 38 |
-
|
| 39 |
-
### New Commands
|
| 40 |
-
```bash
|
| 41 |
-
# Web app
|
| 42 |
-
python src/web/web_app.py
|
| 43 |
-
# Or use helper script
|
| 44 |
-
./scripts/setup/start_web.sh
|
| 45 |
-
|
| 46 |
-
# Modal RAG
|
| 47 |
-
modal run src/rag/modal-rag-product-design.py::query_product_design --question "your question"
|
| 48 |
-
|
| 49 |
-
# Indexing
|
| 50 |
-
modal run src/rag/modal-rag-product-design.py::index_product_design
|
| 51 |
-
```
|
| 52 |
-
|
| 53 |
-
## Import Path Updates
|
| 54 |
-
|
| 55 |
-
If you have custom scripts that import from these modules, update the imports:
|
| 56 |
-
|
| 57 |
-
```python
|
| 58 |
-
# Old
|
| 59 |
-
from query_product_design import query_rag
|
| 60 |
-
|
| 61 |
-
# New
|
| 62 |
-
import sys
|
| 63 |
-
sys.path.insert(0, 'src/web')
|
| 64 |
-
from query_product_design import query_rag
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
## Next Steps
|
| 68 |
-
|
| 69 |
-
1. Update any custom scripts with new import paths
|
| 70 |
-
2. Update CI/CD pipelines if applicable
|
| 71 |
-
3. Update documentation references
|
| 72 |
-
4. Test all functionality
|
| 73 |
-
|
| 74 |
-
## Rollback
|
| 75 |
-
|
| 76 |
-
If you need to rollback, all files are still in git history. You can:
|
| 77 |
-
```bash
|
| 78 |
-
git log --oneline --all -- "old/path/to/file"
|
| 79 |
-
git checkout <commit-hash> -- "old/path/to/file"
|
| 80 |
-
```
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
NEXT_STEPS.md
DELETED
|
@@ -1,174 +0,0 @@
|
|
| 1 |
-
# Next Steps
|
| 2 |
-
|
| 3 |
-
## Current Status
|
| 4 |
-
|
| 5 |
-
β
**Completed:**
|
| 6 |
-
- Repository restructured and organized
|
| 7 |
-
- RAG system configured (Word, PDF, Excel only - no markdown)
|
| 8 |
-
- Web interface functional
|
| 9 |
-
- Nebius deployment guide created
|
| 10 |
-
- Documentation updated
|
| 11 |
-
|
| 12 |
-
## Immediate Next Steps
|
| 13 |
-
|
| 14 |
-
### 1. Test the Updated RAG System
|
| 15 |
-
|
| 16 |
-
**Upload Product Design Documents:**
|
| 17 |
-
```bash
|
| 18 |
-
# Upload Word document (if you have it)
|
| 19 |
-
modal volume put mcp-hack-ins-products \
|
| 20 |
-
docs/product-design/tokyo_auto_insurance_product_design.docx \
|
| 21 |
-
docs/product-design/tokyo_auto_insurance_product_design.docx
|
| 22 |
-
|
| 23 |
-
# Upload PDF (if you have one)
|
| 24 |
-
modal volume put mcp-hack-ins-products \
|
| 25 |
-
docs/product-design/tokyo_auto_insurance_product_design.pdf \
|
| 26 |
-
docs/product-design/tokyo_auto_insurance_product_design.pdf
|
| 27 |
-
|
| 28 |
-
# Upload Excel (if you have one)
|
| 29 |
-
modal volume put mcp-hack-ins-products \
|
| 30 |
-
docs/product-design/tokyo_auto_insurance_product_design.xlsx \
|
| 31 |
-
docs/product-design/tokyo_auto_insurance_product_design.xlsx
|
| 32 |
-
```
|
| 33 |
-
|
| 34 |
-
**Re-index Documents:**
|
| 35 |
-
```bash
|
| 36 |
-
# Using CLI
|
| 37 |
-
python src/web/query_product_design.py --index
|
| 38 |
-
|
| 39 |
-
# Or direct Modal command
|
| 40 |
-
modal run src/rag/modal-rag-product-design.py::index_product_design
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
**Test Queries:**
|
| 44 |
-
```bash
|
| 45 |
-
# Test via CLI
|
| 46 |
-
python src/web/query_product_design.py --query "What are the three product tiers?"
|
| 47 |
-
|
| 48 |
-
# Or start web interface
|
| 49 |
-
python src/web/web_app.py
|
| 50 |
-
# Then open http://127.0.0.1:5000 in browser
|
| 51 |
-
```
|
| 52 |
-
|
| 53 |
-
### 2. Verify File Processing
|
| 54 |
-
|
| 55 |
-
Check that the system correctly:
|
| 56 |
-
- β
Loads Word documents
|
| 57 |
-
- β
Loads PDF documents (if uploaded)
|
| 58 |
-
- β
Loads Excel files (if uploaded)
|
| 59 |
-
- β Ignores markdown files
|
| 60 |
-
- β Ignores other file types
|
| 61 |
-
|
| 62 |
-
### 3. Production Readiness
|
| 63 |
-
|
| 64 |
-
**Option A: Continue with Modal (Current Setup)**
|
| 65 |
-
- β
Already working
|
| 66 |
-
- β
No changes needed
|
| 67 |
-
- Just ensure documents are uploaded and indexed
|
| 68 |
-
|
| 69 |
-
**Option B: Deploy to Nebius**
|
| 70 |
-
- Review: `docs/deployment/NEBIUS_DEPLOYMENT.md`
|
| 71 |
-
- Set up Nebius account
|
| 72 |
-
- Deploy RAG service and web app
|
| 73 |
-
- Migrate from Modal to Nebius
|
| 74 |
-
|
| 75 |
-
## Recommended Path Forward
|
| 76 |
-
|
| 77 |
-
### Short Term (This Week)
|
| 78 |
-
1. **Upload and index documents**
|
| 79 |
-
- Ensure Word/PDF/Excel files are in Modal volume
|
| 80 |
-
- Run indexing
|
| 81 |
-
- Test queries
|
| 82 |
-
|
| 83 |
-
2. **Validate RAG quality**
|
| 84 |
-
- Ask various product questions
|
| 85 |
-
- Verify answer quality and accuracy
|
| 86 |
-
- Check source citations
|
| 87 |
-
|
| 88 |
-
3. **Test web interface**
|
| 89 |
-
- Start web app
|
| 90 |
-
- Test from browser
|
| 91 |
-
- Verify all features work
|
| 92 |
-
|
| 93 |
-
### Medium Term (Next 2 Weeks)
|
| 94 |
-
1. **Optimize RAG performance**
|
| 95 |
-
- Monitor query times
|
| 96 |
-
- Adjust chunk sizes if needed
|
| 97 |
-
- Fine-tune retrieval parameters
|
| 98 |
-
|
| 99 |
-
2. **Add more documents** (if needed)
|
| 100 |
-
- Upload additional product design files
|
| 101 |
-
- Re-index as needed
|
| 102 |
-
|
| 103 |
-
3. **User testing**
|
| 104 |
-
- Share with team/stakeholders
|
| 105 |
-
- Gather feedback
|
| 106 |
-
- Iterate on improvements
|
| 107 |
-
|
| 108 |
-
### Long Term (Next Month)
|
| 109 |
-
1. **Deploy to production**
|
| 110 |
-
- Choose: Modal or Nebius
|
| 111 |
-
- Set up monitoring
|
| 112 |
-
- Configure auto-scaling (if needed)
|
| 113 |
-
|
| 114 |
-
2. **Enhance features**
|
| 115 |
-
- Add authentication (if needed)
|
| 116 |
-
- Add query history
|
| 117 |
-
- Add export functionality
|
| 118 |
-
- Add analytics
|
| 119 |
-
|
| 120 |
-
3. **Scale and optimize**
|
| 121 |
-
- Monitor costs
|
| 122 |
-
- Optimize for performance
|
| 123 |
-
- Add caching if needed
|
| 124 |
-
|
| 125 |
-
## Quick Commands Reference
|
| 126 |
-
|
| 127 |
-
```bash
|
| 128 |
-
# Index documents
|
| 129 |
-
python src/web/query_product_design.py --index
|
| 130 |
-
|
| 131 |
-
# Query via CLI
|
| 132 |
-
python src/web/query_product_design.py --query "your question"
|
| 133 |
-
|
| 134 |
-
# Start web interface
|
| 135 |
-
python src/web/web_app.py
|
| 136 |
-
# Or use helper script:
|
| 137 |
-
./scripts/setup/start_web.sh
|
| 138 |
-
|
| 139 |
-
# Check Modal volume contents
|
| 140 |
-
modal volume list mcp-hack-ins-products
|
| 141 |
-
```
|
| 142 |
-
|
| 143 |
-
## Decision Points
|
| 144 |
-
|
| 145 |
-
1. **Deployment Platform:**
|
| 146 |
-
- [ ] Stay with Modal (current)
|
| 147 |
-
- [ ] Migrate to Nebius
|
| 148 |
-
- [ ] Use both (hybrid)
|
| 149 |
-
|
| 150 |
-
2. **Document Management:**
|
| 151 |
-
- [ ] Keep documents in Modal volume
|
| 152 |
-
- [ ] Move to object storage (S3, etc.)
|
| 153 |
-
- [ ] Use version control
|
| 154 |
-
|
| 155 |
-
3. **Access Control:**
|
| 156 |
-
- [ ] Public access (current)
|
| 157 |
-
- [ ] Add authentication
|
| 158 |
-
- [ ] Add role-based access
|
| 159 |
-
|
| 160 |
-
## Questions to Consider
|
| 161 |
-
|
| 162 |
-
- Do you have Word/PDF/Excel versions of your product design documents?
|
| 163 |
-
- Do you need to convert markdown files to Word/PDF format?
|
| 164 |
-
- Are you ready to deploy to production?
|
| 165 |
-
- Do you need authentication/access control?
|
| 166 |
-
- What's your target user base?
|
| 167 |
-
|
| 168 |
-
## Getting Help
|
| 169 |
-
|
| 170 |
-
- **Documentation:** See `docs/` directory
|
| 171 |
-
- **Troubleshooting:** See `docs/guides/TROUBLESHOOTING.md`
|
| 172 |
-
- **Deployment:** See `docs/deployment/NEBIUS_DEPLOYMENT.md`
|
| 173 |
-
- **Quick Start:** See `QUICK_START.md`
|
| 174 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
diagrams/1-indexing-flow.mmd
DELETED
|
@@ -1,28 +0,0 @@
|
|
| 1 |
-
sequenceDiagram
|
| 2 |
-
participant User
|
| 3 |
-
participant Modal
|
| 4 |
-
participant CreateVectorDB as create_vector_db()
|
| 5 |
-
participant PDFLoader
|
| 6 |
-
participant TextSplitter
|
| 7 |
-
participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
|
| 8 |
-
participant ChromaDB as Remote ChromaDB
|
| 9 |
-
|
| 10 |
-
User->>Modal: modal run modal-rag.py::index
|
| 11 |
-
Modal->>CreateVectorDB: Execute function
|
| 12 |
-
|
| 13 |
-
CreateVectorDB->>PDFLoader: Load PDFs from /insurance-data
|
| 14 |
-
PDFLoader-->>CreateVectorDB: Return documents
|
| 15 |
-
|
| 16 |
-
CreateVectorDB->>TextSplitter: Split documents (chunk_size=1000)
|
| 17 |
-
TextSplitter-->>CreateVectorDB: Return chunks
|
| 18 |
-
|
| 19 |
-
CreateVectorDB->>Embeddings: Initialize (device='cuda')
|
| 20 |
-
CreateVectorDB->>Embeddings: Generate embeddings for chunks
|
| 21 |
-
Embeddings-->>CreateVectorDB: Return embeddings
|
| 22 |
-
|
| 23 |
-
CreateVectorDB->>ChromaDB: Connect to remote service
|
| 24 |
-
CreateVectorDB->>ChromaDB: Upsert chunks + embeddings
|
| 25 |
-
ChromaDB-->>CreateVectorDB: Confirm storage
|
| 26 |
-
|
| 27 |
-
CreateVectorDB-->>Modal: Complete
|
| 28 |
-
Modal-->>User: Success message
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
diagrams/1-indexing-flow.svg
CHANGED
|
|
|
|
diagrams/2-query-flow-medium.mmd
DELETED
|
@@ -1,25 +0,0 @@
|
|
| 1 |
-
sequenceDiagram
|
| 2 |
-
participant User
|
| 3 |
-
participant Modal
|
| 4 |
-
participant RAGModel
|
| 5 |
-
participant Embeddings
|
| 6 |
-
participant ChromaDB
|
| 7 |
-
participant LLM
|
| 8 |
-
|
| 9 |
-
User->>Modal: modal run query --question "..."
|
| 10 |
-
|
| 11 |
-
Note over Modal,RAGModel: Container Startup (if cold)
|
| 12 |
-
Modal->>RAGModel: Initialize
|
| 13 |
-
RAGModel->>Embeddings: Load embedding model (GPU)
|
| 14 |
-
RAGModel->>LLM: Load Mistral-7B (GPU)
|
| 15 |
-
|
| 16 |
-
Note over Modal,LLM: Query Processing
|
| 17 |
-
Modal->>RAGModel: Process question
|
| 18 |
-
RAGModel->>Embeddings: Convert question to vector
|
| 19 |
-
RAGModel->>ChromaDB: Search similar documents
|
| 20 |
-
ChromaDB-->>RAGModel: Top 3 matching docs
|
| 21 |
-
|
| 22 |
-
RAGModel->>LLM: Generate answer + context
|
| 23 |
-
LLM-->>RAGModel: Answer
|
| 24 |
-
|
| 25 |
-
RAGModel-->>User: Display answer + sources
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
diagrams/2-query-flow-medium.svg
CHANGED
|
|
|
|
diagrams/2-query-flow-simple.mmd
DELETED
|
@@ -1,19 +0,0 @@
|
|
| 1 |
-
sequenceDiagram
|
| 2 |
-
participant User
|
| 3 |
-
participant Modal
|
| 4 |
-
participant RAGModel
|
| 5 |
-
participant ChromaDB
|
| 6 |
-
participant LLM as Mistral-7B
|
| 7 |
-
|
| 8 |
-
User->>Modal: Ask question
|
| 9 |
-
Modal->>RAGModel: Initialize (warm container)
|
| 10 |
-
|
| 11 |
-
Note over RAGModel: Load models on GPU
|
| 12 |
-
|
| 13 |
-
RAGModel->>ChromaDB: Search for relevant docs
|
| 14 |
-
ChromaDB-->>RAGModel: Return top 3 documents
|
| 15 |
-
|
| 16 |
-
RAGModel->>LLM: Generate answer with context
|
| 17 |
-
LLM-->>RAGModel: Generated answer
|
| 18 |
-
|
| 19 |
-
RAGModel-->>User: Answer + Sources
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
diagrams/2-query-flow-simple.svg
CHANGED
|
|
|
|
diagrams/2-query-flow.mmd
DELETED
|
@@ -1,39 +0,0 @@
|
|
| 1 |
-
sequenceDiagram
|
| 2 |
-
participant User
|
| 3 |
-
participant Modal
|
| 4 |
-
participant QueryEntrypoint as query()
|
| 5 |
-
participant RAGModel
|
| 6 |
-
participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
|
| 7 |
-
participant ChromaRetriever as RemoteChromaRetriever
|
| 8 |
-
participant ChromaDB as Remote ChromaDB
|
| 9 |
-
participant LLM as Mistral-7B<br/>(A10G GPU)
|
| 10 |
-
participant RAGChain as LangChain RAG
|
| 11 |
-
|
| 12 |
-
User->>Modal: modal run modal-rag.py::query --question "..."
|
| 13 |
-
Modal->>QueryEntrypoint: Execute local entrypoint
|
| 14 |
-
QueryEntrypoint->>RAGModel: Instantiate RAGModel()
|
| 15 |
-
|
| 16 |
-
Note over RAGModel: @modal.enter() lifecycle
|
| 17 |
-
RAGModel->>Embeddings: Load embedding model (CUDA)
|
| 18 |
-
RAGModel->>ChromaDB: Connect to remote service
|
| 19 |
-
RAGModel->>LLM: Load Mistral-7B (A10G GPU)
|
| 20 |
-
RAGModel->>RAGModel: Initialize RemoteChromaRetriever
|
| 21 |
-
|
| 22 |
-
QueryEntrypoint->>RAGModel: query.remote(question)
|
| 23 |
-
|
| 24 |
-
RAGModel->>ChromaRetriever: Create retriever instance
|
| 25 |
-
RAGModel->>RAGChain: Build RAG chain
|
| 26 |
-
|
| 27 |
-
RAGChain->>ChromaRetriever: Retrieve relevant docs
|
| 28 |
-
ChromaRetriever->>Embeddings: embed_query(question)
|
| 29 |
-
Embeddings-->>ChromaRetriever: Query embedding
|
| 30 |
-
ChromaRetriever->>ChromaDB: query(embedding, k=3)
|
| 31 |
-
ChromaDB-->>ChromaRetriever: Top-k documents
|
| 32 |
-
ChromaRetriever-->>RAGChain: Return documents
|
| 33 |
-
|
| 34 |
-
RAGChain->>LLM: Generate answer with context
|
| 35 |
-
LLM-->>RAGChain: Generated answer
|
| 36 |
-
RAGChain-->>RAGModel: Return result
|
| 37 |
-
|
| 38 |
-
RAGModel-->>QueryEntrypoint: Return {answer, sources}
|
| 39 |
-
QueryEntrypoint-->>User: Display answer + sources
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
diagrams/2-query-flow.svg
CHANGED
|
|
|
|
diagrams/3-web-endpoint-flow.mmd
DELETED
|
@@ -1,26 +0,0 @@
|
|
| 1 |
-
sequenceDiagram
|
| 2 |
-
participant User
|
| 3 |
-
participant Browser
|
| 4 |
-
participant Modal as Modal Platform
|
| 5 |
-
participant WebEndpoint as RAGModel.web_query
|
| 6 |
-
participant QueryMethod as RAGModel.query
|
| 7 |
-
participant RAGChain
|
| 8 |
-
participant ChromaDB
|
| 9 |
-
participant LLM
|
| 10 |
-
|
| 11 |
-
User->>Browser: GET https://.../web_query?question=...
|
| 12 |
-
Browser->>Modal: HTTP GET request
|
| 13 |
-
Modal->>WebEndpoint: Route to @modal.fastapi_endpoint
|
| 14 |
-
|
| 15 |
-
WebEndpoint->>QueryMethod: Call query.local(question)
|
| 16 |
-
|
| 17 |
-
Note over QueryMethod,LLM: Same flow as Query diagram
|
| 18 |
-
QueryMethod->>RAGChain: Build chain
|
| 19 |
-
RAGChain->>ChromaDB: Retrieve docs
|
| 20 |
-
RAGChain->>LLM: Generate answer
|
| 21 |
-
LLM-->>QueryMethod: Return result
|
| 22 |
-
|
| 23 |
-
QueryMethod-->>WebEndpoint: Return {answer, sources}
|
| 24 |
-
WebEndpoint-->>Modal: JSON response
|
| 25 |
-
Modal-->>Browser: HTTP 200 + JSON
|
| 26 |
-
Browser-->>User: Display result
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
diagrams/3-web-endpoint-flow.svg
CHANGED
|
|
|
|
diagrams/4-container-lifecycle.mmd
DELETED
|
@@ -1,31 +0,0 @@
|
|
| 1 |
-
sequenceDiagram
|
| 2 |
-
participant Modal
|
| 3 |
-
participant Container
|
| 4 |
-
participant RAGModel
|
| 5 |
-
participant GPU as A10G GPU
|
| 6 |
-
participant Volume as Modal Volume
|
| 7 |
-
participant ChromaDB
|
| 8 |
-
|
| 9 |
-
Modal->>Container: Start container (min_containers=1)
|
| 10 |
-
Container->>GPU: Allocate GPU
|
| 11 |
-
Container->>Volume: Mount /insurance-data
|
| 12 |
-
|
| 13 |
-
Container->>RAGModel: Call @modal.enter()
|
| 14 |
-
|
| 15 |
-
Note over RAGModel: Initialization phase
|
| 16 |
-
RAGModel->>RAGModel: Load HuggingFaceEmbeddings (CUDA)
|
| 17 |
-
RAGModel->>ChromaDB: Connect to remote service
|
| 18 |
-
RAGModel->>RAGModel: Load Mistral-7B (GPU)
|
| 19 |
-
RAGModel->>RAGModel: Create RemoteChromaRetriever class
|
| 20 |
-
|
| 21 |
-
RAGModel-->>Container: Ready
|
| 22 |
-
Container-->>Modal: Container warm and ready
|
| 23 |
-
|
| 24 |
-
Note over Modal,Container: Container stays warm (min_containers=1)
|
| 25 |
-
|
| 26 |
-
loop Handle requests
|
| 27 |
-
Modal->>RAGModel: Invoke query() method
|
| 28 |
-
RAGModel-->>Modal: Return result
|
| 29 |
-
end
|
| 30 |
-
|
| 31 |
-
Note over Modal,Container: Container persists until scaled down
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
diagrams/4-container-lifecycle.svg
CHANGED
|
|
|
|
diagrams/finetuning.svg
CHANGED
|
|
|
|
docs/NEXT_STEPS.md
ADDED
|
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Next Steps & Roadmap
|
| 2 |
+
|
| 3 |
+
## β
Current Status
|
| 4 |
+
|
| 5 |
+
**Completed:**
|
| 6 |
+
- Fine-tuning pipeline with vLLM optimization
|
| 7 |
+
- RAG system with local ChromaDB
|
| 8 |
+
- High-performance inference (<3s latency)
|
| 9 |
+
- Model merging for production deployment
|
| 10 |
+
- Comprehensive documentation
|
| 11 |
+
|
| 12 |
+
## π― Immediate Next Steps
|
| 13 |
+
|
| 14 |
+
### 1. Test Fine-Tuned Model Performance
|
| 15 |
+
|
| 16 |
+
```bash
|
| 17 |
+
# Test the vLLM-optimized endpoint
|
| 18 |
+
curl -X POST https://mcp-hack--phi3-inference-vllm-model-ask.modal.run \
|
| 19 |
+
-H "Content-Type: application/json" \
|
| 20 |
+
-d '{"question": "What is the population of Tokyo?", "context": "Japan Census data"}'
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
### 2. Test RAG System
|
| 24 |
+
|
| 25 |
+
```bash
|
| 26 |
+
# Test the RAG endpoint
|
| 27 |
+
curl -X POST https://mcp-hack--rag-vllm-optimized-ragmodel-query.modal.run \
|
| 28 |
+
-H "Content-Type: application/json" \
|
| 29 |
+
-d '{"question": "What insurance products are available?"}'
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
### 3. Monitor Performance
|
| 33 |
+
|
| 34 |
+
- Check latency metrics in responses
|
| 35 |
+
- Verify <3s response times
|
| 36 |
+
- Monitor GPU utilization on Modal dashboard
|
| 37 |
+
|
| 38 |
+
## π Short Term (This Week)
|
| 39 |
+
|
| 40 |
+
### Fine-Tuning Improvements
|
| 41 |
+
- [ ] Run evaluation script to assess model quality
|
| 42 |
+
- [ ] Collect more training data if needed
|
| 43 |
+
- [ ] Experiment with different LoRA parameters
|
| 44 |
+
- [ ] Test on diverse queries
|
| 45 |
+
|
| 46 |
+
### RAG Enhancements
|
| 47 |
+
- [ ] Add more insurance documents to volume
|
| 48 |
+
- [ ] Re-index with updated documents
|
| 49 |
+
- [ ] Test retrieval quality
|
| 50 |
+
- [ ] Optimize chunk sizes if needed
|
| 51 |
+
|
| 52 |
+
### Documentation
|
| 53 |
+
- [ ] Add API usage examples
|
| 54 |
+
- [ ] Create deployment guide
|
| 55 |
+
- [ ] Document troubleshooting steps
|
| 56 |
+
|
| 57 |
+
## π Medium Term (Next 2 Weeks)
|
| 58 |
+
|
| 59 |
+
### Model Optimization
|
| 60 |
+
1. **Fine-tuning iterations**
|
| 61 |
+
- Analyze evaluation results
|
| 62 |
+
- Adjust training parameters
|
| 63 |
+
- Re-train if needed
|
| 64 |
+
|
| 65 |
+
2. **RAG improvements**
|
| 66 |
+
- Experiment with different embedding models
|
| 67 |
+
- Optimize retrieval parameters (top-k, similarity threshold)
|
| 68 |
+
- Add query rewriting
|
| 69 |
+
|
| 70 |
+
3. **Performance monitoring**
|
| 71 |
+
- Set up logging
|
| 72 |
+
- Track latency trends
|
| 73 |
+
- Monitor costs
|
| 74 |
+
|
| 75 |
+
### Feature Additions
|
| 76 |
+
- [ ] Add streaming responses
|
| 77 |
+
- [ ] Implement caching layer
|
| 78 |
+
- [ ] Add query history
|
| 79 |
+
- [ ] Create admin dashboard
|
| 80 |
+
|
| 81 |
+
## π¨ Long Term (Next Month)
|
| 82 |
+
|
| 83 |
+
### Production Readiness
|
| 84 |
+
1. **Deployment**
|
| 85 |
+
- Set up CI/CD pipeline
|
| 86 |
+
- Configure monitoring and alerts
|
| 87 |
+
- Implement rate limiting
|
| 88 |
+
- Add authentication if needed
|
| 89 |
+
|
| 90 |
+
2. **Scaling**
|
| 91 |
+
- Optimize container scaling
|
| 92 |
+
- Implement load balancing
|
| 93 |
+
- Add caching (Redis)
|
| 94 |
+
- Set up CDN for static assets
|
| 95 |
+
|
| 96 |
+
3. **Advanced Features**
|
| 97 |
+
- Multi-modal support (images, tables)
|
| 98 |
+
- Batch processing
|
| 99 |
+
- A/B testing framework
|
| 100 |
+
- Analytics dashboard
|
| 101 |
+
|
| 102 |
+
## π§ Technical Debt
|
| 103 |
+
|
| 104 |
+
- [ ] Remove `bkp/` directory (old backup files)
|
| 105 |
+
- [ ] Clean up unused dependencies
|
| 106 |
+
- [ ] Add comprehensive tests
|
| 107 |
+
- [ ] Improve error handling
|
| 108 |
+
- [ ] Add input validation
|
| 109 |
+
|
| 110 |
+
## π Metrics to Track
|
| 111 |
+
|
| 112 |
+
**Performance:**
|
| 113 |
+
- Inference latency (target: <3s)
|
| 114 |
+
- Retrieval accuracy
|
| 115 |
+
- GPU utilization
|
| 116 |
+
- Cost per query
|
| 117 |
+
|
| 118 |
+
**Quality:**
|
| 119 |
+
- Model accuracy on evaluation set
|
| 120 |
+
- RAG relevance scores
|
| 121 |
+
- User satisfaction (if applicable)
|
| 122 |
+
|
| 123 |
+
## π€ Decision Points
|
| 124 |
+
|
| 125 |
+
1. **Model Selection:**
|
| 126 |
+
- [ ] Continue with Phi-3-mini
|
| 127 |
+
- [ ] Experiment with larger models
|
| 128 |
+
- [ ] Try different base models
|
| 129 |
+
|
| 130 |
+
2. **Infrastructure:**
|
| 131 |
+
- [ ] Stay with Modal (current)
|
| 132 |
+
- [ ] Migrate to other platform
|
| 133 |
+
- [ ] Self-hosted deployment
|
| 134 |
+
|
| 135 |
+
3. **Data Strategy:**
|
| 136 |
+
- [ ] Expand training dataset
|
| 137 |
+
- [ ] Add domain-specific data
|
| 138 |
+
- [ ] Implement data versioning
|
| 139 |
+
|
| 140 |
+
## π Quick Reference
|
| 141 |
+
|
| 142 |
+
### Key Commands
|
| 143 |
+
```bash
|
| 144 |
+
# Fine-tuning
|
| 145 |
+
./venv/bin/modal run src/finetune/finetune_modal.py
|
| 146 |
+
|
| 147 |
+
# Model merging
|
| 148 |
+
./venv/bin/modal run src/finetune/merge_model.py
|
| 149 |
+
|
| 150 |
+
# Deploy vLLM endpoint (fine-tuned)
|
| 151 |
+
./venv/bin/modal deploy src/finetune/api_endpoint_vllm.py
|
| 152 |
+
|
| 153 |
+
# Deploy RAG endpoint
|
| 154 |
+
./venv/bin/modal deploy src/rag/rag_vllm.py
|
| 155 |
+
|
| 156 |
+
# Evaluation
|
| 157 |
+
./venv/bin/modal run src/finetune/eval_finetuned.py
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
### Documentation
|
| 161 |
+
- **Main Guide:** `docs/HOW_TO_RUN.md`
|
| 162 |
+
- **Architecture:** `diagrams/` folder
|
| 163 |
+
- **Testing:** `docs/TESTING.md`
|
| 164 |
+
- **Agent Design:** `docs/agentdesign.md`
|
| 165 |
+
|
| 166 |
+
## π― Success Criteria
|
| 167 |
+
|
| 168 |
+
**Phase 1 (Current):**
|
| 169 |
+
- β
<3s inference latency
|
| 170 |
+
- β
vLLM optimization working
|
| 171 |
+
- β
RAG retrieval functional
|
| 172 |
+
|
| 173 |
+
**Phase 2 (Next):**
|
| 174 |
+
- [ ] >90% accuracy on evaluation set
|
| 175 |
+
- [ ] <2s average latency
|
| 176 |
+
- [ ] Production deployment complete
|
| 177 |
+
|
| 178 |
+
**Phase 3 (Future):**
|
| 179 |
+
- [ ] Multi-user support
|
| 180 |
+
- [ ] Advanced analytics
|
| 181 |
+
- [ ] Cost optimization (<$X per 1K queries)
|
QUICK_START.md β docs/QUICK_START.md
RENAMED
|
File without changes
|
docs/QUICK_START_API.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Quick Start: RAG API
|
| 2 |
+
|
| 3 |
+
Fast API endpoint for querying product design documents with <3 second response times.
|
| 4 |
+
|
| 5 |
+
## Deploy the API
|
| 6 |
+
|
| 7 |
+
```bash
|
| 8 |
+
# Deploy to Modal
|
| 9 |
+
modal deploy src/rag/rag_api.py
|
| 10 |
+
|
| 11 |
+
# Get the API URL
|
| 12 |
+
modal app show insurance-rag-api
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
## Use the API
|
| 16 |
+
|
| 17 |
+
### Python Client
|
| 18 |
+
|
| 19 |
+
```python
|
| 20 |
+
from src.rag.api_client import RAGAPIClient
|
| 21 |
+
|
| 22 |
+
# Initialize client
|
| 23 |
+
client = RAGAPIClient(base_url="https://your-api-url.modal.run")
|
| 24 |
+
|
| 25 |
+
# Query
|
| 26 |
+
result = client.query("What are the three product tiers?")
|
| 27 |
+
print(result['answer'])
|
| 28 |
+
print(f"Response time: {result['total_time']:.2f}s")
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
### cURL
|
| 32 |
+
|
| 33 |
+
```bash
|
| 34 |
+
curl -X POST https://your-api-url.modal.run/query \
|
| 35 |
+
-H "Content-Type: application/json" \
|
| 36 |
+
-d '{"question": "What are the three product tiers?"}'
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
### JavaScript
|
| 40 |
+
|
| 41 |
+
```javascript
|
| 42 |
+
const response = await fetch('https://your-api-url.modal.run/query', {
|
| 43 |
+
method: 'POST',
|
| 44 |
+
headers: { 'Content-Type': 'application/json' },
|
| 45 |
+
body: JSON.stringify({ question: 'What are the three product tiers?' })
|
| 46 |
+
});
|
| 47 |
+
|
| 48 |
+
const data = await response.json();
|
| 49 |
+
console.log(data.answer);
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
## Test Performance
|
| 53 |
+
|
| 54 |
+
```bash
|
| 55 |
+
# Test with default URL
|
| 56 |
+
python tests/test_api.py
|
| 57 |
+
|
| 58 |
+
# Test with custom URL
|
| 59 |
+
python tests/test_api.py --url https://your-api-url.modal.run
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## Performance Target
|
| 63 |
+
|
| 64 |
+
- **Target**: <3 seconds per query
|
| 65 |
+
- **Typical**: 1.5-2.5 seconds
|
| 66 |
+
- **Optimizations**: Warm containers, reduced tokens, limited context
|
| 67 |
+
|
| 68 |
+
## API Endpoints
|
| 69 |
+
|
| 70 |
+
- `GET /health` - Health check
|
| 71 |
+
- `POST /query` - Query the RAG system
|
| 72 |
+
- `GET /` - API information
|
| 73 |
+
|
| 74 |
+
See `docs/api/RAG_API.md` for full documentation.
|
| 75 |
+
|
docs/README.md
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Documentation Index
|
| 2 |
+
|
| 3 |
+
This directory contains all project documentation.
|
| 4 |
+
|
| 5 |
+
## π Main Guides
|
| 6 |
+
|
| 7 |
+
### Getting Started
|
| 8 |
+
- **[HOW_TO_RUN.md](HOW_TO_RUN.md)** - Complete guide to running the fine-tuning pipeline
|
| 9 |
+
- **[QUICK_START.md](QUICK_START.md)** - Quick start guide for the project
|
| 10 |
+
- **[QUICK_START_API.md](QUICK_START_API.md)** - API quick start guide
|
| 11 |
+
|
| 12 |
+
### Fine-Tuning
|
| 13 |
+
- **[finetune/](../finetune/)** - Fine-tuning documentation and guides
|
| 14 |
+
- Data preparation
|
| 15 |
+
- Dataset generation
|
| 16 |
+
- Model training
|
| 17 |
+
- Evaluation
|
| 18 |
+
|
| 19 |
+
### RAG System
|
| 20 |
+
- **[README_RAG.md](README_RAG.md)** - RAG system overview
|
| 21 |
+
- **[guides/QUICK_START_RAG.md](guides/QUICK_START_RAG.md)** - RAG quick start
|
| 22 |
+
- **[guides/RAG_SETUP_COMPLETE.md](guides/RAG_SETUP_COMPLETE.md)** - Complete RAG setup guide
|
| 23 |
+
- **[api/RAG_API.md](api/RAG_API.md)** - RAG API documentation
|
| 24 |
+
|
| 25 |
+
### Deployment
|
| 26 |
+
- **[deployment/](deployment/)** - Deployment guides
|
| 27 |
+
- **[README.md](deployment/README.md)** - Deployment overview
|
| 28 |
+
- **[NEBIUS_DEPLOYMENT.md](deployment/NEBIUS_DEPLOYMENT.md)** - Nebius deployment guide
|
| 29 |
+
|
| 30 |
+
### Reference
|
| 31 |
+
- **[STRUCTURE.md](STRUCTURE.md)** - Project structure overview
|
| 32 |
+
- **[TESTING.md](TESTING.md)** - Testing guide
|
| 33 |
+
- **[MIGRATION_GUIDE.md](MIGRATION_GUIDE.md)** - Migration guide
|
| 34 |
+
- **[VLLM_MIGRATION.md](VLLM_MIGRATION.md)** - vLLM migration guide
|
| 35 |
+
- **[NEXT_STEPS.md](NEXT_STEPS.md)** - Next steps and roadmap
|
| 36 |
+
|
| 37 |
+
### Agent Design
|
| 38 |
+
- **[agentdesign.md](agentdesign.md)** - AI agent design for automated development workflow
|
| 39 |
+
|
| 40 |
+
### Product Design
|
| 41 |
+
- **[product-design/](product-design/)** - Product design guides and examples
|
| 42 |
+
- Product decision guide
|
| 43 |
+
- RAG setup for product design
|
| 44 |
+
- Example: Tokyo auto insurance product design
|
| 45 |
+
|
| 46 |
+
## π§ Additional Resources
|
| 47 |
+
|
| 48 |
+
### Data Sources
|
| 49 |
+
- **[guides/estat_api_guide.md](guides/estat_api_guide.md)** - e-Stat API guide
|
| 50 |
+
- **[guides/source_data.md](guides/source_data.md)** - Data source documentation
|
| 51 |
+
- **[guides/ft_process.md](guides/ft_process.md)** - Fine-tuning process details
|
| 52 |
+
|
| 53 |
+
### Troubleshooting
|
| 54 |
+
- **[guides/TROUBLESHOOTING.md](guides/TROUBLESHOOTING.md)** - General troubleshooting
|
| 55 |
+
- **[guides/WEB_TROUBLESHOOTING.md](guides/WEB_TROUBLESHOOTING.md)** - Web interface troubleshooting
|
| 56 |
+
|
| 57 |
+
### Web Interface
|
| 58 |
+
- **[guides/WEB_INTERFACE.md](guides/WEB_INTERFACE.md)** - Web interface documentation
|
README_RAG.md β docs/README_RAG.md
RENAMED
|
File without changes
|
STRUCTURE.md β docs/STRUCTURE.md
RENAMED
|
File without changes
|
TESTING.md β docs/TESTING.md
RENAMED
|
File without changes
|
VLLM_MIGRATION.md β docs/VLLM_MIGRATION.md
RENAMED
|
File without changes
|
docs/api/RAG_API.md
ADDED
|
@@ -0,0 +1,244 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# RAG API Documentation
|
| 2 |
+
|
| 3 |
+
Fast API endpoint for querying the product design RAG system with <3 second response times.
|
| 4 |
+
|
| 5 |
+
## Quick Start
|
| 6 |
+
|
| 7 |
+
### Deploy the API
|
| 8 |
+
|
| 9 |
+
```bash
|
| 10 |
+
# Deploy to Modal
|
| 11 |
+
modal deploy src/rag/rag_api.py
|
| 12 |
+
|
| 13 |
+
# Get the URL
|
| 14 |
+
modal app list
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
### Use the API
|
| 18 |
+
|
| 19 |
+
```python
|
| 20 |
+
from src.rag.api_client import RAGAPIClient
|
| 21 |
+
|
| 22 |
+
client = RAGAPIClient(base_url="https://your-modal-url.modal.run")
|
| 23 |
+
result = client.query("What are the three product tiers?")
|
| 24 |
+
print(result['answer'])
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
## API Endpoints
|
| 28 |
+
|
| 29 |
+
### Health Check
|
| 30 |
+
|
| 31 |
+
```http
|
| 32 |
+
GET /health
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
**Response:**
|
| 36 |
+
```json
|
| 37 |
+
{
|
| 38 |
+
"status": "healthy",
|
| 39 |
+
"service": "rag-api"
|
| 40 |
+
}
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
### Query
|
| 44 |
+
|
| 45 |
+
```http
|
| 46 |
+
POST /query
|
| 47 |
+
Content-Type: application/json
|
| 48 |
+
|
| 49 |
+
{
|
| 50 |
+
"question": "What are the three product tiers?",
|
| 51 |
+
"top_k": 5,
|
| 52 |
+
"max_tokens": 1024
|
| 53 |
+
}
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
**Response:**
|
| 57 |
+
```json
|
| 58 |
+
{
|
| 59 |
+
"answer": "The three product tiers are...",
|
| 60 |
+
"retrieval_time": 0.45,
|
| 61 |
+
"generation_time": 1.23,
|
| 62 |
+
"total_time": 1.68,
|
| 63 |
+
"sources": [
|
| 64 |
+
{
|
| 65 |
+
"content": "...",
|
| 66 |
+
"metadata": {...}
|
| 67 |
+
}
|
| 68 |
+
],
|
| 69 |
+
"success": true
|
| 70 |
+
}
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## Performance Optimization
|
| 74 |
+
|
| 75 |
+
### Target: <3 Second Responses
|
| 76 |
+
|
| 77 |
+
The API is optimized for fast responses:
|
| 78 |
+
|
| 79 |
+
1. **Warm Containers**: `min_containers=1` keeps a container ready
|
| 80 |
+
2. **Optimized LLM**: Reduced max_tokens (1024 vs 1536)
|
| 81 |
+
3. **Limited Context**: Top 3 documents, 800 chars each
|
| 82 |
+
4. **Prefix Caching**: Enabled for faster generation
|
| 83 |
+
5. **Concurrent Requests**: Up to 10 concurrent requests
|
| 84 |
+
|
| 85 |
+
### Response Time Breakdown
|
| 86 |
+
|
| 87 |
+
- **Retrieval**: 0.3-0.8 seconds
|
| 88 |
+
- **Generation**: 1.0-2.0 seconds
|
| 89 |
+
- **Total**: 1.5-3.0 seconds (target: <3s)
|
| 90 |
+
|
| 91 |
+
## Usage Examples
|
| 92 |
+
|
| 93 |
+
### Python Client
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
from src.rag.api_client import RAGAPIClient
|
| 97 |
+
|
| 98 |
+
# Initialize
|
| 99 |
+
client = RAGAPIClient(base_url="https://your-api-url.modal.run")
|
| 100 |
+
|
| 101 |
+
# Health check
|
| 102 |
+
health = client.health_check()
|
| 103 |
+
print(health)
|
| 104 |
+
|
| 105 |
+
# Query
|
| 106 |
+
result = client.query("What are the premium ranges?")
|
| 107 |
+
print(result['answer'])
|
| 108 |
+
|
| 109 |
+
# Fast query (optimized for speed)
|
| 110 |
+
result = client.query_fast("What are the three tiers?")
|
| 111 |
+
print(result['answer'])
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
### cURL
|
| 115 |
+
|
| 116 |
+
```bash
|
| 117 |
+
# Health check
|
| 118 |
+
curl https://your-api-url.modal.run/health
|
| 119 |
+
|
| 120 |
+
# Query
|
| 121 |
+
curl -X POST https://your-api-url.modal.run/query \
|
| 122 |
+
-H "Content-Type: application/json" \
|
| 123 |
+
-d '{
|
| 124 |
+
"question": "What are the three product tiers?",
|
| 125 |
+
"top_k": 5,
|
| 126 |
+
"max_tokens": 1024
|
| 127 |
+
}'
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
### JavaScript/TypeScript
|
| 131 |
+
|
| 132 |
+
```javascript
|
| 133 |
+
const response = await fetch('https://your-api-url.modal.run/query', {
|
| 134 |
+
method: 'POST',
|
| 135 |
+
headers: {
|
| 136 |
+
'Content-Type': 'application/json',
|
| 137 |
+
},
|
| 138 |
+
body: JSON.stringify({
|
| 139 |
+
question: 'What are the three product tiers?',
|
| 140 |
+
top_k: 5,
|
| 141 |
+
max_tokens: 1024
|
| 142 |
+
})
|
| 143 |
+
});
|
| 144 |
+
|
| 145 |
+
const data = await response.json();
|
| 146 |
+
console.log(data.answer);
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
## Configuration
|
| 150 |
+
|
| 151 |
+
### Environment Variables
|
| 152 |
+
|
| 153 |
+
- `MODAL_APP_NAME`: App name (default: "insurance-rag-api")
|
| 154 |
+
- `MODAL_VOLUME_NAME`: Volume name (default: "mcp-hack-ins-products")
|
| 155 |
+
|
| 156 |
+
### API Parameters
|
| 157 |
+
|
| 158 |
+
- `question` (required): The question to ask
|
| 159 |
+
- `top_k` (optional, default: 5): Number of documents to retrieve
|
| 160 |
+
- `max_tokens` (optional, default: 1024): Maximum response length
|
| 161 |
+
|
| 162 |
+
## Performance Tips
|
| 163 |
+
|
| 164 |
+
1. **Use Fast Query**: For speed-critical applications, use `query_fast()` method
|
| 165 |
+
2. **Reduce top_k**: Lower `top_k` (e.g., 3) for faster retrieval
|
| 166 |
+
3. **Reduce max_tokens**: Lower `max_tokens` (e.g., 512) for faster generation
|
| 167 |
+
4. **Cache Results**: Cache common queries client-side
|
| 168 |
+
5. **Batch Requests**: If possible, batch multiple queries
|
| 169 |
+
|
| 170 |
+
## Error Handling
|
| 171 |
+
|
| 172 |
+
```python
|
| 173 |
+
result = client.query("your question")
|
| 174 |
+
|
| 175 |
+
if result.get("success"):
|
| 176 |
+
print(result['answer'])
|
| 177 |
+
else:
|
| 178 |
+
print(f"Error: {result.get('error', 'Unknown error')}")
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
## Monitoring
|
| 182 |
+
|
| 183 |
+
### Response Times
|
| 184 |
+
|
| 185 |
+
Monitor the `total_time` field in responses:
|
| 186 |
+
- < 2s: Excellent
|
| 187 |
+
- 2-3s: Good (target)
|
| 188 |
+
- > 3s: May need optimization
|
| 189 |
+
|
| 190 |
+
### Health Monitoring
|
| 191 |
+
|
| 192 |
+
```python
|
| 193 |
+
health = client.health_check()
|
| 194 |
+
if health.get("status") != "healthy":
|
| 195 |
+
# Handle unhealthy state
|
| 196 |
+
pass
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
## Deployment
|
| 200 |
+
|
| 201 |
+
### Modal Deployment
|
| 202 |
+
|
| 203 |
+
```bash
|
| 204 |
+
# Deploy
|
| 205 |
+
modal deploy src/rag/rag_api.py
|
| 206 |
+
|
| 207 |
+
# Get URL
|
| 208 |
+
modal app show insurance-rag-api
|
| 209 |
+
```
|
| 210 |
+
|
| 211 |
+
### Local Testing
|
| 212 |
+
|
| 213 |
+
```bash
|
| 214 |
+
# Run locally (for development)
|
| 215 |
+
modal serve src/rag/rag_api.py
|
| 216 |
+
```
|
| 217 |
+
|
| 218 |
+
## Rate Limiting
|
| 219 |
+
|
| 220 |
+
The API supports up to 10 concurrent requests. For higher throughput:
|
| 221 |
+
- Deploy multiple instances
|
| 222 |
+
- Use load balancer
|
| 223 |
+
- Implement client-side rate limiting
|
| 224 |
+
|
| 225 |
+
## Security
|
| 226 |
+
|
| 227 |
+
- Add authentication if needed
|
| 228 |
+
- Use HTTPS in production
|
| 229 |
+
- Implement rate limiting
|
| 230 |
+
- Validate input questions
|
| 231 |
+
|
| 232 |
+
## Troubleshooting
|
| 233 |
+
|
| 234 |
+
### Slow Responses (>3s)
|
| 235 |
+
- Check if container is warm (`min_containers=1`)
|
| 236 |
+
- Reduce `max_tokens`
|
| 237 |
+
- Reduce `top_k`
|
| 238 |
+
- Check network latency
|
| 239 |
+
|
| 240 |
+
### Errors
|
| 241 |
+
- Verify documents are indexed
|
| 242 |
+
- Check Modal app status
|
| 243 |
+
- Review error messages in response
|
| 244 |
+
|
docs/deployment/ADD_GUIDES_TO_RAG.md
DELETED
|
@@ -1,146 +0,0 @@
|
|
| 1 |
-
# RAG Indexing Configuration
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
The RAG system indexes **only Word, PDF, and Excel files** containing product design information. **All markdown files are excluded** from indexing to keep the RAG focused on structured product documents.
|
| 6 |
-
|
| 7 |
-
## Currently Indexed Files
|
| 8 |
-
|
| 9 |
-
The system automatically indexes files that match these patterns:
|
| 10 |
-
|
| 11 |
-
1. **Word Documents (.docx):**
|
| 12 |
-
- Files with `tokyo_auto_insurance` or `product_design` in the filename
|
| 13 |
-
- Example: `tokyo_auto_insurance_product_design.docx`
|
| 14 |
-
|
| 15 |
-
2. **PDF Documents (.pdf):**
|
| 16 |
-
- Files with `tokyo_auto_insurance` or `product_design` in the filename
|
| 17 |
-
- Example: `tokyo_auto_insurance_product_design.pdf`
|
| 18 |
-
|
| 19 |
-
3. **Excel Spreadsheets (.xlsx, .xls):**
|
| 20 |
-
- Files with `tokyo_auto_insurance` or `product_design` in the filename
|
| 21 |
-
- Example: `tokyo_auto_insurance_product_design.xlsx`
|
| 22 |
-
|
| 23 |
-
## Excluded Files
|
| 24 |
-
|
| 25 |
-
The following files are **NOT indexed**:
|
| 26 |
-
|
| 27 |
-
- β **All markdown files** (`.md`, `.markdown`) - completely excluded
|
| 28 |
-
- β Guide files (e.g., `QUICK_START_RAG.md`, `PRODUCT_DECISION_GUIDE.md`)
|
| 29 |
-
- β Setup guides (e.g., `setup_product_design_rag.md`)
|
| 30 |
-
- β Troubleshooting guides
|
| 31 |
-
- β Web interface guides
|
| 32 |
-
- β Any other file types (`.txt`, `.csv`, `.json`, etc.)
|
| 33 |
-
|
| 34 |
-
## Files That Will Be Indexed
|
| 35 |
-
|
| 36 |
-
Based on the current repository structure:
|
| 37 |
-
|
| 38 |
-
β
**Will be indexed (if uploaded to Modal volume):**
|
| 39 |
-
- `tokyo_auto_insurance_product_design.docx` (Word document)
|
| 40 |
-
- `tokyo_auto_insurance_product_design.pdf` (PDF document)
|
| 41 |
-
- `tokyo_auto_insurance_product_design.xlsx` (Excel spreadsheet)
|
| 42 |
-
- `tokyo_auto_insurance_product_design.xls` (Excel 97-2003)
|
| 43 |
-
|
| 44 |
-
β **Will NOT be indexed (all excluded):**
|
| 45 |
-
- `tokyo_auto_insurance_product_design.md` (markdown - excluded)
|
| 46 |
-
- `tokyo_auto_insurance_product_design_filled.md` (markdown - excluded)
|
| 47 |
-
- `QUICK_START_RAG.md` (markdown - excluded)
|
| 48 |
-
- `PRODUCT_DECISION_GUIDE.md` (markdown - excluded)
|
| 49 |
-
- `setup_product_design_rag.md` (markdown - excluded)
|
| 50 |
-
- `TROUBLESHOOTING.md` (markdown - excluded)
|
| 51 |
-
- `WEB_INTERFACE.md` (markdown - excluded)
|
| 52 |
-
- All other markdown and non-supported file types
|
| 53 |
-
|
| 54 |
-
## How to Add More Product Design Files
|
| 55 |
-
|
| 56 |
-
### Option 1: Use Supported File Formats
|
| 57 |
-
Convert your files to one of the supported formats:
|
| 58 |
-
- **Word**: `.docx` format
|
| 59 |
-
- **PDF**: `.pdf` format
|
| 60 |
-
- **Excel**: `.xlsx` or `.xls` format
|
| 61 |
-
|
| 62 |
-
**Important:**
|
| 63 |
-
- The file must contain `tokyo_auto_insurance` **OR** `product_design` in the filename
|
| 64 |
-
- Markdown files (`.md`) are **not supported** and will be ignored
|
| 65 |
-
|
| 66 |
-
### Option 2: Update the Loader
|
| 67 |
-
Edit `src/rag/modal-rag-product-design.py` and modify the pattern matching:
|
| 68 |
-
|
| 69 |
-
```python
|
| 70 |
-
# Current pattern for PDF files (line ~81):
|
| 71 |
-
if 'tokyo_auto_insurance' in file_lower or 'product_design' in file_lower:
|
| 72 |
-
pdf_files.append(full_path)
|
| 73 |
-
|
| 74 |
-
# To add more patterns, modify to:
|
| 75 |
-
if ('tokyo_auto_insurance' in file_lower or
|
| 76 |
-
'product_design' in file_lower or
|
| 77 |
-
'your_custom_pattern' in file_lower):
|
| 78 |
-
pdf_files.append(full_path)
|
| 79 |
-
```
|
| 80 |
-
|
| 81 |
-
**Note:** All markdown files are intentionally excluded. Only Word, PDF, and Excel files are processed.
|
| 82 |
-
|
| 83 |
-
## Uploading to Modal Volume
|
| 84 |
-
|
| 85 |
-
To index product design documents, upload **only Word, PDF, or Excel files** to the Modal volume:
|
| 86 |
-
|
| 87 |
-
```bash
|
| 88 |
-
# Upload Word document
|
| 89 |
-
modal volume put mcp-hack-ins-products \
|
| 90 |
-
docs/product-design/tokyo_auto_insurance_product_design.docx \
|
| 91 |
-
docs/product-design/tokyo_auto_insurance_product_design.docx
|
| 92 |
-
|
| 93 |
-
# Upload PDF document (if you have one)
|
| 94 |
-
modal volume put mcp-hack-ins-products \
|
| 95 |
-
docs/product-design/tokyo_auto_insurance_product_design.pdf \
|
| 96 |
-
docs/product-design/tokyo_auto_insurance_product_design.pdf
|
| 97 |
-
|
| 98 |
-
# Upload Excel spreadsheet (if you have one)
|
| 99 |
-
modal volume put mcp-hack-ins-products \
|
| 100 |
-
docs/product-design/tokyo_auto_insurance_product_design.xlsx \
|
| 101 |
-
docs/product-design/tokyo_auto_insurance_product_design.xlsx
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
**Important Notes:**
|
| 105 |
-
- β **Do NOT upload markdown files** (`.md`) - they will be ignored
|
| 106 |
-
- β
Only `.docx`, `.pdf`, `.xlsx`, and `.xls` files are processed
|
| 107 |
-
- β
Files must contain `tokyo_auto_insurance` or `product_design` in the filename
|
| 108 |
-
|
| 109 |
-
## Re-indexing
|
| 110 |
-
|
| 111 |
-
After uploading new files, re-index:
|
| 112 |
-
|
| 113 |
-
```bash
|
| 114 |
-
# Using CLI
|
| 115 |
-
python src/web/query_product_design.py --index
|
| 116 |
-
|
| 117 |
-
# Or direct Modal command
|
| 118 |
-
modal run src/rag/modal-rag-product-design.py::index_product_design
|
| 119 |
-
```
|
| 120 |
-
|
| 121 |
-
## Benefits of Current Approach
|
| 122 |
-
|
| 123 |
-
By focusing only on Word, PDF, and Excel files:
|
| 124 |
-
- β
RAG answers are focused on structured product documents
|
| 125 |
-
- β
No confusion from markdown guide/instruction content
|
| 126 |
-
- β
Faster retrieval (smaller, more focused document set)
|
| 127 |
-
- β
More accurate product-related answers from official documents
|
| 128 |
-
- β
Better handling of tables and structured data (Excel, Word tables)
|
| 129 |
-
- β
Cleaner source citations
|
| 130 |
-
- β
Support for professional document formats
|
| 131 |
-
|
| 132 |
-
## Example Queries
|
| 133 |
-
|
| 134 |
-
With product design documents indexed, you can ask:
|
| 135 |
-
|
| 136 |
-
```
|
| 137 |
-
"What are the three product tiers and their premium ranges?"
|
| 138 |
-
"What is the Year 3 premium volume projection?"
|
| 139 |
-
"What are the FSA licensing requirements?"
|
| 140 |
-
"What coverage does the Standard tier include?"
|
| 141 |
-
"What is the target market size in Tokyo?"
|
| 142 |
-
"Who are the main competitors?"
|
| 143 |
-
```
|
| 144 |
-
|
| 145 |
-
The RAG system will retrieve relevant sections from the product design documents only, ensuring answers are focused on product information.
|
| 146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/guides/HOW_TO_RUN.md
DELETED
|
@@ -1,215 +0,0 @@
|
|
| 1 |
-
# How to Run the Fine-Tuning Pipeline
|
| 2 |
-
|
| 3 |
-
This guide walks you through the complete pipeline from data generation to model deployment.
|
| 4 |
-
|
| 5 |
-
---
|
| 6 |
-
|
| 7 |
-
## π Dataset Generation Results
|
| 8 |
-
|
| 9 |
-
### Final Statistics
|
| 10 |
-
- **Training Samples**: 201,651
|
| 11 |
-
- **Validation Samples**: 22,407
|
| 12 |
-
- **Total Dataset**: 224,058 high-quality QA pairs
|
| 13 |
-
- **Improvement**: 150x more data than previous approach
|
| 14 |
-
|
| 15 |
-
### Batch Performance
|
| 16 |
-
| Batch | Files | Data Points | Status |
|
| 17 |
-
|-------|-------|-------------|--------|
|
| 18 |
-
| 1 | 1,000 | 100,611 | β
Excellent |
|
| 19 |
-
| 2 | 1,000 | 39,960 | β
Good |
|
| 20 |
-
| 3 | 1,000 | 0 | β οΈ Complex files |
|
| 21 |
-
| 4 | 1,000 | 600 | β οΈ Runner issue |
|
| 22 |
-
| 5 | 1,000 | 54,627 | β
Excellent |
|
| 23 |
-
| 6 | 1,000 | 5,400 | β
Good |
|
| 24 |
-
| 7 | 888 | 22,860 | β
Good |
|
| 25 |
-
|
| 26 |
-
---
|
| 27 |
-
|
| 28 |
-
## π Step-by-Step Instructions
|
| 29 |
-
|
| 30 |
-
### Step 1: Fine-Tune the Model
|
| 31 |
-
|
| 32 |
-
Run the fine-tuning job on Modal with H200 GPU:
|
| 33 |
-
|
| 34 |
-
```bash
|
| 35 |
-
cd /Users/veeru/agents/mcp-hack
|
| 36 |
-
|
| 37 |
-
# Start fine-tuning in detached mode
|
| 38 |
-
./venv/bin/modal run --detach docs/finetune_modal.py
|
| 39 |
-
```
|
| 40 |
-
|
| 41 |
-
**What happens:**
|
| 42 |
-
- Loads 201,651 training samples from `finetune-dataset` volume
|
| 43 |
-
- Trains Phi-3-mini-4k-instruct with LoRA on H200 GPU
|
| 44 |
-
- Runs for ~90-120 minutes
|
| 45 |
-
- Saves model to `model-checkpoints` volume
|
| 46 |
-
|
| 47 |
-
**Monitor progress:**
|
| 48 |
-
```bash
|
| 49 |
-
# View live logs
|
| 50 |
-
modal app logs mcp-hack::finetune-phi3-modal
|
| 51 |
-
```
|
| 52 |
-
|
| 53 |
-
---
|
| 54 |
-
|
| 55 |
-
### Step 2: Evaluate the Model
|
| 56 |
-
|
| 57 |
-
After training completes, test the model:
|
| 58 |
-
|
| 59 |
-
```bash
|
| 60 |
-
./venv/bin/modal run docs/eval_finetuned.py
|
| 61 |
-
```
|
| 62 |
-
|
| 63 |
-
This will run sample questions and show the model's answers.
|
| 64 |
-
|
| 65 |
-
---
|
| 66 |
-
|
| 67 |
-
### Step 3: Deploy API Endpoint
|
| 68 |
-
|
| 69 |
-
Deploy the inference API:
|
| 70 |
-
|
| 71 |
-
**Option A: GPU Endpoint (A10G)**
|
| 72 |
-
```bash
|
| 73 |
-
./venv/bin/modal deploy docs/api_endpoint.py
|
| 74 |
-
```
|
| 75 |
-
|
| 76 |
-
**Option B: CPU Endpoint**
|
| 77 |
-
```bash
|
| 78 |
-
./venv/bin/modal deploy docs/api_endpoint_cpu.py
|
| 79 |
-
```
|
| 80 |
-
|
| 81 |
-
**Get the endpoint URL:**
|
| 82 |
-
```bash
|
| 83 |
-
modal app list
|
| 84 |
-
```
|
| 85 |
-
|
| 86 |
-
---
|
| 87 |
-
|
| 88 |
-
### Step 4: Test the API
|
| 89 |
-
|
| 90 |
-
```bash
|
| 91 |
-
# Example API call
|
| 92 |
-
curl -X POST https://YOUR-MODAL-URL/ask \
|
| 93 |
-
-H "Content-Type: application/json" \
|
| 94 |
-
-d '{
|
| 95 |
-
"question": "What is the population of Tokyo?",
|
| 96 |
-
"context": "Japan Census data"
|
| 97 |
-
}'
|
| 98 |
-
```
|
| 99 |
-
|
| 100 |
-
---
|
| 101 |
-
|
| 102 |
-
## π Key Files
|
| 103 |
-
|
| 104 |
-
### Data Processing
|
| 105 |
-
- `docs/prepare_finetune_data.py` - Generates dataset from CSV files
|
| 106 |
-
- `docs/clean_sample.py` - Local testing script for data cleaning
|
| 107 |
-
|
| 108 |
-
### Model Training
|
| 109 |
-
- `docs/finetune_modal.py` - Fine-tuning script (H200 GPU)
|
| 110 |
-
- `docs/eval_finetuned.py` - Evaluation script
|
| 111 |
-
|
| 112 |
-
### API Deployment
|
| 113 |
-
- `docs/api_endpoint.py` - GPU inference endpoint (A10G)
|
| 114 |
-
- `docs/api_endpoint_cpu.py` - CPU inference endpoint
|
| 115 |
-
|
| 116 |
-
### Documentation
|
| 117 |
-
- `diagrams/finetuning.svg` - Visual pipeline diagram
|
| 118 |
-
- `finetune/04-evaluation.md` - Evaluation results
|
| 119 |
-
|
| 120 |
-
---
|
| 121 |
-
|
| 122 |
-
## π§ Modal Volumes
|
| 123 |
-
|
| 124 |
-
The pipeline uses these Modal volumes:
|
| 125 |
-
|
| 126 |
-
| Volume | Purpose | Size |
|
| 127 |
-
|--------|---------|------|
|
| 128 |
-
| `census-data` | Raw census CSV files | 6,838 files |
|
| 129 |
-
| `economy-labor-data` | Raw economy CSV files | 50 files |
|
| 130 |
-
| `finetune-dataset` | Generated JSONL training data | 224K samples |
|
| 131 |
-
| `model-checkpoints` | Fine-tuned model weights | ~7GB |
|
| 132 |
-
|
| 133 |
-
---
|
| 134 |
-
|
| 135 |
-
## π‘ Tips
|
| 136 |
-
|
| 137 |
-
### If Training Fails
|
| 138 |
-
```bash
|
| 139 |
-
# Check logs for errors
|
| 140 |
-
modal app logs mcp-hack::finetune-phi3-modal
|
| 141 |
-
|
| 142 |
-
# Restart training
|
| 143 |
-
./venv/bin/modal run --detach docs/finetune_modal.py
|
| 144 |
-
```
|
| 145 |
-
|
| 146 |
-
### If You Need to Regenerate Data
|
| 147 |
-
```bash
|
| 148 |
-
# Clear existing dataset
|
| 149 |
-
./venv/bin/modal run docs/clear_dataset.py
|
| 150 |
-
|
| 151 |
-
# Regenerate with new logic
|
| 152 |
-
./venv/bin/modal run --detach docs/prepare_finetune_data.py
|
| 153 |
-
```
|
| 154 |
-
|
| 155 |
-
### View Volume Contents
|
| 156 |
-
```bash
|
| 157 |
-
# List files in a volume
|
| 158 |
-
modal volume ls finetune-dataset
|
| 159 |
-
|
| 160 |
-
# Download a file
|
| 161 |
-
modal volume get finetune-dataset train.jsonl finetune/train.jsonl
|
| 162 |
-
```
|
| 163 |
-
|
| 164 |
-
---
|
| 165 |
-
|
| 166 |
-
## π Expected Timeline
|
| 167 |
-
|
| 168 |
-
| Step | Duration | Notes |
|
| 169 |
-
|------|----------|-------|
|
| 170 |
-
| Data Generation | β
Complete | 224K samples ready |
|
| 171 |
-
| Fine-Tuning | ~90-120 min | H200 GPU |
|
| 172 |
-
| Evaluation | ~5 min | Quick tests |
|
| 173 |
-
| API Deployment | ~2 min | Instant after deploy |
|
| 174 |
-
|
| 175 |
-
---
|
| 176 |
-
|
| 177 |
-
## π― Next Steps
|
| 178 |
-
|
| 179 |
-
1. **Run fine-tuning** (see Step 1 above)
|
| 180 |
-
2. **Wait for completion** (~2 hours)
|
| 181 |
-
3. **Evaluate results** (see Step 2)
|
| 182 |
-
4. **Deploy API** (see Step 3)
|
| 183 |
-
5. **Test with real queries** (see Step 4)
|
| 184 |
-
|
| 185 |
-
---
|
| 186 |
-
|
| 187 |
-
## π Troubleshooting
|
| 188 |
-
|
| 189 |
-
**Issue**: "Volume not found"
|
| 190 |
-
```bash
|
| 191 |
-
# List all volumes
|
| 192 |
-
modal volume list
|
| 193 |
-
```
|
| 194 |
-
|
| 195 |
-
**Issue**: "Out of memory during training"
|
| 196 |
-
- Reduce `per_device_train_batch_size` in `finetune_modal.py`
|
| 197 |
-
- Current: 2 (already optimized for H200)
|
| 198 |
-
|
| 199 |
-
**Issue**: "Model not loading in API"
|
| 200 |
-
- Ensure fine-tuning completed successfully
|
| 201 |
-
- Check `model-checkpoints` volume has files
|
| 202 |
-
|
| 203 |
-
---
|
| 204 |
-
|
| 205 |
-
## β
Success Criteria
|
| 206 |
-
|
| 207 |
-
After completing all steps, you should have:
|
| 208 |
-
- β
Fine-tuned Phi-3-mini model
|
| 209 |
-
- β
Deployed API endpoint
|
| 210 |
-
- β
Model answering questions about Japanese census/economy data
|
| 211 |
-
- β
Improved accuracy over base model
|
| 212 |
-
|
| 213 |
-
---
|
| 214 |
-
|
| 215 |
-
**Ready to start?** Run the fine-tuning command from Step 1!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/guides/SETUP_SUCCESS.md
DELETED
|
@@ -1,63 +0,0 @@
|
|
| 1 |
-
# β
RAG Setup Successful!
|
| 2 |
-
|
| 3 |
-
## Status: Working
|
| 4 |
-
|
| 5 |
-
The product design RAG system is now fully operational!
|
| 6 |
-
|
| 7 |
-
### What Was Fixed
|
| 8 |
-
|
| 9 |
-
1. **File Detection**: Updated to find files in both root and `docs/` subdirectory
|
| 10 |
-
2. **GPU Fallback**: Added CPU fallback for embeddings (works without GPU)
|
| 11 |
-
3. **Word Document**: Markdown file works perfectly (Word file has python-docx issue but markdown has all content)
|
| 12 |
-
4. **Modal Command**: Auto-detects Modal in venv
|
| 13 |
-
|
| 14 |
-
### Current Status
|
| 15 |
-
|
| 16 |
-
β
**Indexed**: 1 document (markdown), 56 chunks
|
| 17 |
-
β
**Vector DB**: Created in ChromaDB collection `product_design`
|
| 18 |
-
β
**Queries**: Working! Tested successfully
|
| 19 |
-
|
| 20 |
-
### Test Results
|
| 21 |
-
|
| 22 |
-
```bash
|
| 23 |
-
$ python3 query_product_design.py --query "What are the three product tiers?"
|
| 24 |
-
```
|
| 25 |
-
|
| 26 |
-
**Result**: β
Successfully retrieved and answered!
|
| 27 |
-
|
| 28 |
-
## Usage
|
| 29 |
-
|
| 30 |
-
### Query the Document
|
| 31 |
-
|
| 32 |
-
```bash
|
| 33 |
-
# Single query
|
| 34 |
-
python3 query_product_design.py --query "What are the three product tiers?"
|
| 35 |
-
|
| 36 |
-
# Interactive mode
|
| 37 |
-
python3 query_product_design.py --interactive
|
| 38 |
-
```
|
| 39 |
-
|
| 40 |
-
### Example Questions
|
| 41 |
-
|
| 42 |
-
- "What are the three product tiers and their premium ranges?"
|
| 43 |
-
- "What is the Year 3 premium volume projection?"
|
| 44 |
-
- "What coverage does the Standard tier include?"
|
| 45 |
-
- "What are the FSA licensing requirements?"
|
| 46 |
-
|
| 47 |
-
## Known Issues
|
| 48 |
-
|
| 49 |
-
1. **Word Document**: The `.docx` file has a python-docx compatibility issue with Modal volumes, but the markdown file contains all the same content and works perfectly.
|
| 50 |
-
|
| 51 |
-
2. **Answer Truncation**: Some answers may be truncated. This is normal - the system retrieves the most relevant chunks and generates concise answers.
|
| 52 |
-
|
| 53 |
-
## Next Steps
|
| 54 |
-
|
| 55 |
-
1. β
**Indexing**: Complete
|
| 56 |
-
2. β
**Query System**: Working
|
| 57 |
-
3. π― **Ready to Use**: You can now query the product design document!
|
| 58 |
-
|
| 59 |
-
Try it:
|
| 60 |
-
```bash
|
| 61 |
-
python3 query_product_design.py --interactive
|
| 62 |
-
```
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/guides/SUMMARY.md
DELETED
|
@@ -1,114 +0,0 @@
|
|
| 1 |
-
# β
Complete Setup Summary
|
| 2 |
-
|
| 3 |
-
## What Was Accomplished
|
| 4 |
-
|
| 5 |
-
### 1. Product Design Document β
|
| 6 |
-
- **Created**: Comprehensive 1,600-line product design document
|
| 7 |
-
- **Filled**: All sections with realistic fictional data for "TokyoDrive Insurance"
|
| 8 |
-
- **Formats**:
|
| 9 |
-
- Markdown: `docs/tokyo_auto_insurance_product_design_filled.md`
|
| 10 |
-
- Word: `docs/tokyo_auto_insurance_product_design.docx`
|
| 11 |
-
- **Content**: 12 comprehensive sections covering all aspects of product design
|
| 12 |
-
|
| 13 |
-
### 2. RAG System Extension β
|
| 14 |
-
- **Created**: `src/modal-rag-product-design.py`
|
| 15 |
-
- **Features**:
|
| 16 |
-
- Supports Markdown and Word documents
|
| 17 |
-
- Separate ChromaDB collection (doesn't interfere with existing RAG)
|
| 18 |
-
- GPU-accelerated with Phi-3 model
|
| 19 |
-
- Integrated with existing Modal infrastructure
|
| 20 |
-
|
| 21 |
-
### 3. Query Interface β
|
| 22 |
-
- **Created**: `query_product_design.py` - Simple CLI tool
|
| 23 |
-
- **Features**:
|
| 24 |
-
- Interactive mode for continuous queries
|
| 25 |
-
- Single query mode
|
| 26 |
-
- Index command
|
| 27 |
-
- Clean, formatted output
|
| 28 |
-
|
| 29 |
-
### 4. Documentation β
|
| 30 |
-
- `docs/QUICK_START_RAG.md` - Quick start guide
|
| 31 |
-
- `docs/setup_product_design_rag.md` - Detailed setup
|
| 32 |
-
- `docs/next_steps_rag_recommendation.md` - Decision guide
|
| 33 |
-
- `docs/RAG_SETUP_COMPLETE.md` - Complete setup info
|
| 34 |
-
- `README_RAG.md` - Quick reference
|
| 35 |
-
|
| 36 |
-
## File Structure
|
| 37 |
-
|
| 38 |
-
```
|
| 39 |
-
mcp-hack/
|
| 40 |
-
βββ src/
|
| 41 |
-
β βββ modal-rag-product-design.py # Extended RAG system
|
| 42 |
-
βββ query_product_design.py # CLI query interface
|
| 43 |
-
βββ docs/
|
| 44 |
-
β βββ tokyo_auto_insurance_product_design_filled.md
|
| 45 |
-
β βββ tokyo_auto_insurance_product_design.docx
|
| 46 |
-
β βββ QUICK_START_RAG.md
|
| 47 |
-
β βββ setup_product_design_rag.md
|
| 48 |
-
β βββ next_steps_rag_recommendation.md
|
| 49 |
-
β βββ RAG_SETUP_COMPLETE.md
|
| 50 |
-
β βββ SUMMARY.md (this file)
|
| 51 |
-
βββ README_RAG.md # Quick reference
|
| 52 |
-
```
|
| 53 |
-
|
| 54 |
-
## Next Steps to Use
|
| 55 |
-
|
| 56 |
-
### Step 1: Index Documents (One-Time)
|
| 57 |
-
```bash
|
| 58 |
-
python query_product_design.py --index
|
| 59 |
-
```
|
| 60 |
-
β±οΈ Takes 2-5 minutes
|
| 61 |
-
|
| 62 |
-
### Step 2: Query the Document
|
| 63 |
-
```bash
|
| 64 |
-
# Single query
|
| 65 |
-
python query_product_design.py --query "What are the three product tiers?"
|
| 66 |
-
|
| 67 |
-
# Interactive mode
|
| 68 |
-
python query_product_design.py --interactive
|
| 69 |
-
```
|
| 70 |
-
|
| 71 |
-
## Example Use Cases
|
| 72 |
-
|
| 73 |
-
### For Development
|
| 74 |
-
- Extract technical requirements
|
| 75 |
-
- Get API specifications
|
| 76 |
-
- Understand system architecture
|
| 77 |
-
|
| 78 |
-
### For Sales/Marketing
|
| 79 |
-
- Get pricing information
|
| 80 |
-
- Understand product features
|
| 81 |
-
- Compare tiers
|
| 82 |
-
|
| 83 |
-
### For Compliance
|
| 84 |
-
- Check regulatory requirements
|
| 85 |
-
- Get licensing info
|
| 86 |
-
- Understand data privacy rules
|
| 87 |
-
|
| 88 |
-
### For Financial Planning
|
| 89 |
-
- Get projections
|
| 90 |
-
- Understand cost structure
|
| 91 |
-
- Check break-even analysis
|
| 92 |
-
|
| 93 |
-
## Key Features
|
| 94 |
-
|
| 95 |
-
β
**Comprehensive Document**: 12 sections, 1,600 lines, fully filled with realistic data
|
| 96 |
-
β
**RAG System**: Semantic search + LLM for intelligent Q&A
|
| 97 |
-
β
**Easy Interface**: Simple CLI tool, no complex setup
|
| 98 |
-
β
**Fast Queries**: 3-5 seconds after initial warm-up
|
| 99 |
-
β
**Separate Collection**: Doesn't interfere with existing insurance products RAG
|
| 100 |
-
|
| 101 |
-
## Status
|
| 102 |
-
|
| 103 |
-
π **Everything is ready!**
|
| 104 |
-
|
| 105 |
-
1. β
Product design document created and filled
|
| 106 |
-
2. β
Documents uploaded to Modal volume
|
| 107 |
-
3. β
RAG system extended
|
| 108 |
-
4. β
Query interface created
|
| 109 |
-
5. β
Documentation complete
|
| 110 |
-
|
| 111 |
-
**Ready to index and query!**
|
| 112 |
-
|
| 113 |
-
Run: `python query_product_design.py --index`
|
| 114 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/guides/modal-rag-optimization.md
DELETED
|
@@ -1,370 +0,0 @@
|
|
| 1 |
-
# Modal RAG Performance Optimization Guide
|
| 2 |
-
|
| 3 |
-
**Current Performance**: >1 minute per query
|
| 4 |
-
**Target Performance**: <5 seconds per query
|
| 5 |
-
|
| 6 |
-
## π Performance Bottleneck Analysis
|
| 7 |
-
|
| 8 |
-
### Current Architecture Issues
|
| 9 |
-
|
| 10 |
-
1. **Model Loading Time** (~30-45 seconds)
|
| 11 |
-
- Mistral-7B (13GB) loads on every cold start
|
| 12 |
-
- Embedding model loads separately
|
| 13 |
-
- No model caching between requests
|
| 14 |
-
|
| 15 |
-
2. **LLM Inference Time** (~15-30 seconds)
|
| 16 |
-
- Mistral-7B is slow for inference
|
| 17 |
-
- Running on A10G GPU (good, but model is large)
|
| 18 |
-
- No inference optimization (quantization, etc.)
|
| 19 |
-
|
| 20 |
-
3. **Network Latency** (~2-5 seconds)
|
| 21 |
-
- Remote ChromaDB calls
|
| 22 |
-
- Modal container communication overhead
|
| 23 |
-
|
| 24 |
-
---
|
| 25 |
-
|
| 26 |
-
## π Optimization Strategies (Ranked by Impact)
|
| 27 |
-
|
| 28 |
-
### 1. **Keep Containers Warm** βββββ
|
| 29 |
-
**Impact**: Eliminates 30-45s cold start time
|
| 30 |
-
|
| 31 |
-
**Current**:
|
| 32 |
-
```python
|
| 33 |
-
min_containers=1 # Already doing this β
|
| 34 |
-
```
|
| 35 |
-
|
| 36 |
-
**Why it helps**: Your container stays loaded with models in memory. First query after deployment is slow, but subsequent queries are fast.
|
| 37 |
-
|
| 38 |
-
**Cost**: ~$0.50-1.00/hour for warm A10G container
|
| 39 |
-
|
| 40 |
-
---
|
| 41 |
-
|
| 42 |
-
### 2. **Switch to Smaller/Faster LLM** βββββ
|
| 43 |
-
**Impact**: Reduces inference from 15-30s to 2-5s
|
| 44 |
-
|
| 45 |
-
**Options**:
|
| 46 |
-
|
| 47 |
-
#### Option A: Mistral-7B-Instruct-v0.2 (Quantized)
|
| 48 |
-
```python
|
| 49 |
-
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
| 50 |
-
|
| 51 |
-
quantization_config = BitsAndBytesConfig(
|
| 52 |
-
load_in_4bit=True,
|
| 53 |
-
bnb_4bit_compute_dtype=torch.float16,
|
| 54 |
-
bnb_4bit_use_double_quant=True,
|
| 55 |
-
bnb_4bit_quant_type="nf4"
|
| 56 |
-
)
|
| 57 |
-
|
| 58 |
-
self.model = AutoModelForCausalLM.from_pretrained(
|
| 59 |
-
LLM_MODEL,
|
| 60 |
-
quantization_config=quantization_config,
|
| 61 |
-
device_map="auto"
|
| 62 |
-
)
|
| 63 |
-
```
|
| 64 |
-
- **Speed**: 3-5x faster (5-10s β 1-3s)
|
| 65 |
-
- **Quality**: Minimal degradation
|
| 66 |
-
- **Memory**: 13GB β 3.5GB
|
| 67 |
-
|
| 68 |
-
#### Option B: Switch to Phi-3-mini (3.8B)
|
| 69 |
-
```python
|
| 70 |
-
LLM_MODEL = "microsoft/Phi-3-mini-4k-instruct"
|
| 71 |
-
```
|
| 72 |
-
- **Speed**: 5-10x faster than Mistral-7B
|
| 73 |
-
- **Quality**: Good for RAG tasks
|
| 74 |
-
- **Memory**: ~8GB β 4GB
|
| 75 |
-
- **Inference**: 2-4 seconds
|
| 76 |
-
|
| 77 |
-
#### Option C: Use TinyLlama-1.1B
|
| 78 |
-
```python
|
| 79 |
-
LLM_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
|
| 80 |
-
```
|
| 81 |
-
- **Speed**: 10-20x faster
|
| 82 |
-
- **Quality**: Lower, but acceptable for simple queries
|
| 83 |
-
- **Memory**: ~2GB
|
| 84 |
-
- **Inference**: <1 second
|
| 85 |
-
|
| 86 |
-
---
|
| 87 |
-
|
| 88 |
-
### 3. **Use vLLM for Inference** ββββ
|
| 89 |
-
**Impact**: 2-5x faster inference
|
| 90 |
-
|
| 91 |
-
```python
|
| 92 |
-
# Install vLLM
|
| 93 |
-
image = modal.Image.debian_slim(python_version="3.11").pip_install(
|
| 94 |
-
"vllm==0.6.0",
|
| 95 |
-
# ... other packages
|
| 96 |
-
)
|
| 97 |
-
|
| 98 |
-
# In RAGModel.enter()
|
| 99 |
-
from vllm import LLM, SamplingParams
|
| 100 |
-
|
| 101 |
-
self.llm_engine = LLM(
|
| 102 |
-
model=LLM_MODEL,
|
| 103 |
-
tensor_parallel_size=1,
|
| 104 |
-
gpu_memory_utilization=0.9,
|
| 105 |
-
max_model_len=2048 # Shorter context for speed
|
| 106 |
-
)
|
| 107 |
-
|
| 108 |
-
# In query method
|
| 109 |
-
sampling_params = SamplingParams(
|
| 110 |
-
temperature=0.7,
|
| 111 |
-
max_tokens=256,
|
| 112 |
-
top_p=0.9
|
| 113 |
-
)
|
| 114 |
-
outputs = self.llm_engine.generate([prompt], sampling_params)
|
| 115 |
-
```
|
| 116 |
-
|
| 117 |
-
**Benefits**:
|
| 118 |
-
- Continuous batching
|
| 119 |
-
- PagedAttention (efficient memory)
|
| 120 |
-
- Optimized CUDA kernels
|
| 121 |
-
- 2-5x faster than HuggingFace pipeline
|
| 122 |
-
|
| 123 |
-
---
|
| 124 |
-
|
| 125 |
-
### 4. **Optimize Embedding Generation** βββ
|
| 126 |
-
**Impact**: Reduces query embedding time from 1-2s to 0.2-0.5s
|
| 127 |
-
|
| 128 |
-
#### Option A: Use Smaller Embedding Model
|
| 129 |
-
```python
|
| 130 |
-
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
|
| 131 |
-
# 384 dimensions vs 384 (bge-small is already good)
|
| 132 |
-
```
|
| 133 |
-
|
| 134 |
-
#### Option B: Use ONNX Runtime
|
| 135 |
-
```python
|
| 136 |
-
from optimum.onnxruntime import ORTModelForFeatureExtraction
|
| 137 |
-
|
| 138 |
-
self.embeddings = ORTModelForFeatureExtraction.from_pretrained(
|
| 139 |
-
EMBEDDING_MODEL,
|
| 140 |
-
export=True,
|
| 141 |
-
provider="CUDAExecutionProvider"
|
| 142 |
-
)
|
| 143 |
-
```
|
| 144 |
-
- **Speed**: 2-3x faster
|
| 145 |
-
- **Quality**: Identical
|
| 146 |
-
|
| 147 |
-
---
|
| 148 |
-
|
| 149 |
-
### 5. **Reduce Context Window** βββ
|
| 150 |
-
**Impact**: Faster LLM processing
|
| 151 |
-
|
| 152 |
-
```python
|
| 153 |
-
# In query method
|
| 154 |
-
sampling_params = SamplingParams(
|
| 155 |
-
max_tokens=128, # Instead of 256 or 512
|
| 156 |
-
temperature=0.7
|
| 157 |
-
)
|
| 158 |
-
|
| 159 |
-
# Reduce retrieved documents
|
| 160 |
-
top_k = 2 # Instead of 3
|
| 161 |
-
```
|
| 162 |
-
|
| 163 |
-
**Why**: Less tokens to process = faster inference
|
| 164 |
-
|
| 165 |
-
---
|
| 166 |
-
|
| 167 |
-
### 6. **Cache ChromaDB Queries** ββ
|
| 168 |
-
**Impact**: Saves 1-2s on repeated queries
|
| 169 |
-
|
| 170 |
-
```python
|
| 171 |
-
from functools import lru_cache
|
| 172 |
-
import hashlib
|
| 173 |
-
|
| 174 |
-
@lru_cache(maxsize=100)
|
| 175 |
-
def get_cached_docs(query_hash):
|
| 176 |
-
return self.retriever.get_relevant_documents(query)
|
| 177 |
-
|
| 178 |
-
# In query method
|
| 179 |
-
query_hash = hashlib.md5(question.encode()).hexdigest()
|
| 180 |
-
docs = get_cached_docs(query_hash)
|
| 181 |
-
```
|
| 182 |
-
|
| 183 |
-
---
|
| 184 |
-
|
| 185 |
-
### 7. **Use Faster GPU** ββ
|
| 186 |
-
**Impact**: 1.5-2x faster inference
|
| 187 |
-
|
| 188 |
-
```python
|
| 189 |
-
@app.cls(
|
| 190 |
-
gpu="A100", # Instead of A10G
|
| 191 |
-
# or
|
| 192 |
-
gpu="H100", # Even faster
|
| 193 |
-
)
|
| 194 |
-
```
|
| 195 |
-
|
| 196 |
-
**Cost**: A100 is 2-3x more expensive than A10G
|
| 197 |
-
|
| 198 |
-
---
|
| 199 |
-
|
| 200 |
-
### 8. **Parallel Processing** ββ
|
| 201 |
-
**Impact**: Overlap embedding + retrieval
|
| 202 |
-
|
| 203 |
-
```python
|
| 204 |
-
import asyncio
|
| 205 |
-
|
| 206 |
-
async def query_async(self, question: str):
|
| 207 |
-
# Run embedding and LLM prep in parallel
|
| 208 |
-
embedding_task = asyncio.create_task(
|
| 209 |
-
self.get_query_embedding(question)
|
| 210 |
-
)
|
| 211 |
-
|
| 212 |
-
# ... rest of async pipeline
|
| 213 |
-
```
|
| 214 |
-
|
| 215 |
-
---
|
| 216 |
-
|
| 217 |
-
## π― Recommended Implementation Plan
|
| 218 |
-
|
| 219 |
-
### Phase 1: Quick Wins (Get to <10s)
|
| 220 |
-
1. β
**Keep containers warm** (already done)
|
| 221 |
-
2. **Add 4-bit quantization** to Mistral-7B
|
| 222 |
-
3. **Reduce max_tokens** to 128
|
| 223 |
-
4. **Use top_k=2** instead of 3
|
| 224 |
-
|
| 225 |
-
**Expected**: 60s β 8-12s
|
| 226 |
-
|
| 227 |
-
---
|
| 228 |
-
|
| 229 |
-
### Phase 2: Major Speedup (Get to <5s)
|
| 230 |
-
1. **Switch to vLLM** for inference
|
| 231 |
-
2. **Use Phi-3-mini** instead of Mistral-7B
|
| 232 |
-
3. **Optimize embeddings** with ONNX
|
| 233 |
-
|
| 234 |
-
**Expected**: 8-12s β 3-5s
|
| 235 |
-
|
| 236 |
-
---
|
| 237 |
-
|
| 238 |
-
### Phase 3: Ultra-Fast (Get to <2s)
|
| 239 |
-
1. **Use TinyLlama** for simple queries
|
| 240 |
-
2. **Implement query caching**
|
| 241 |
-
3. **Upgrade to A100 GPU**
|
| 242 |
-
|
| 243 |
-
**Expected**: 3-5s β 1-2s
|
| 244 |
-
|
| 245 |
-
---
|
| 246 |
-
|
| 247 |
-
## π Performance Comparison Table
|
| 248 |
-
|
| 249 |
-
| Configuration | Cold Start | Warm Query | Cost/Hour | Quality |
|
| 250 |
-
|--------------|------------|------------|-----------|---------|
|
| 251 |
-
| **Current** (Mistral-7B, A10G) | 45s | 15-30s | $0.50 | βββββ |
|
| 252 |
-
| **Phase 1** (Quantized, warm) | 30s | 8-12s | $0.50 | ββββ |
|
| 253 |
-
| **Phase 2** (vLLM + Phi-3) | 20s | 3-5s | $0.50 | ββββ |
|
| 254 |
-
| **Phase 3** (TinyLlama, A100) | 10s | 1-2s | $1.50 | βββ |
|
| 255 |
-
|
| 256 |
-
---
|
| 257 |
-
|
| 258 |
-
## π§ Code Changes for Phase 2 (Recommended)
|
| 259 |
-
|
| 260 |
-
### 1. Update model configuration
|
| 261 |
-
```python
|
| 262 |
-
LLM_MODEL = "microsoft/Phi-3-mini-4k-instruct"
|
| 263 |
-
EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5" # Keep same
|
| 264 |
-
```
|
| 265 |
-
|
| 266 |
-
### 2. Add vLLM to dependencies
|
| 267 |
-
```python
|
| 268 |
-
image = modal.Image.debian_slim(python_version="3.11").pip_install(
|
| 269 |
-
"vllm==0.6.0",
|
| 270 |
-
"langchain==0.3.7",
|
| 271 |
-
# ... rest
|
| 272 |
-
)
|
| 273 |
-
```
|
| 274 |
-
|
| 275 |
-
### 3. Update RAGModel.enter()
|
| 276 |
-
```python
|
| 277 |
-
from vllm import LLM, SamplingParams
|
| 278 |
-
|
| 279 |
-
self.llm_engine = LLM(
|
| 280 |
-
model=LLM_MODEL,
|
| 281 |
-
tensor_parallel_size=1,
|
| 282 |
-
gpu_memory_utilization=0.85,
|
| 283 |
-
max_model_len=2048
|
| 284 |
-
)
|
| 285 |
-
|
| 286 |
-
self.sampling_params = SamplingParams(
|
| 287 |
-
temperature=0.7,
|
| 288 |
-
max_tokens=128,
|
| 289 |
-
top_p=0.9
|
| 290 |
-
)
|
| 291 |
-
```
|
| 292 |
-
|
| 293 |
-
### 4. Update query method
|
| 294 |
-
```python
|
| 295 |
-
# Build prompt
|
| 296 |
-
prompt = f"""Use the following context to answer the question.
|
| 297 |
-
|
| 298 |
-
Context: {context}
|
| 299 |
-
|
| 300 |
-
Question: {question}
|
| 301 |
-
|
| 302 |
-
Answer:"""
|
| 303 |
-
|
| 304 |
-
# Generate with vLLM
|
| 305 |
-
outputs = self.llm_engine.generate([prompt], self.sampling_params)
|
| 306 |
-
answer = outputs[0].outputs[0].text
|
| 307 |
-
```
|
| 308 |
-
|
| 309 |
-
---
|
| 310 |
-
|
| 311 |
-
## π° Cost vs Performance Trade-offs
|
| 312 |
-
|
| 313 |
-
| Approach | Speed Gain | Cost Change | Implementation |
|
| 314 |
-
|----------|-----------|-------------|----------------|
|
| 315 |
-
| Quantization | 3-5x | $0 | Easy |
|
| 316 |
-
| vLLM | 2-5x | $0 | Medium |
|
| 317 |
-
| Smaller model | 5-10x | $0 | Easy |
|
| 318 |
-
| A100 GPU | 1.5-2x | +200% | Easy |
|
| 319 |
-
| Caching | Variable | $0 | Medium |
|
| 320 |
-
|
| 321 |
-
---
|
| 322 |
-
|
| 323 |
-
## π¬ Next Steps
|
| 324 |
-
|
| 325 |
-
1. **Measure current performance** with logging
|
| 326 |
-
2. **Implement Phase 1** (quantization + reduce tokens)
|
| 327 |
-
3. **Test and measure** improvement
|
| 328 |
-
4. **Implement Phase 2** if needed (vLLM + Phi-3)
|
| 329 |
-
5. **Monitor** and iterate
|
| 330 |
-
|
| 331 |
-
---
|
| 332 |
-
|
| 333 |
-
## π Performance Monitoring Code
|
| 334 |
-
|
| 335 |
-
Add this to track performance:
|
| 336 |
-
|
| 337 |
-
```python
|
| 338 |
-
import time
|
| 339 |
-
|
| 340 |
-
@modal.method()
|
| 341 |
-
def query(self, question: str, top_k: int = 2):
|
| 342 |
-
start = time.time()
|
| 343 |
-
|
| 344 |
-
# Embedding time
|
| 345 |
-
embed_start = time.time()
|
| 346 |
-
retriever = self.RemoteChromaRetriever(...)
|
| 347 |
-
embed_time = time.time() - embed_start
|
| 348 |
-
|
| 349 |
-
# Retrieval time
|
| 350 |
-
retrieval_start = time.time()
|
| 351 |
-
docs = retriever.get_relevant_documents(question)
|
| 352 |
-
retrieval_time = time.time() - retrieval_start
|
| 353 |
-
|
| 354 |
-
# LLM time
|
| 355 |
-
llm_start = time.time()
|
| 356 |
-
result = chain.invoke({"question": question})
|
| 357 |
-
llm_time = time.time() - llm_start
|
| 358 |
-
|
| 359 |
-
total_time = time.time() - start
|
| 360 |
-
|
| 361 |
-
print(f"β±οΈ Performance:")
|
| 362 |
-
print(f" Embedding: {embed_time:.2f}s")
|
| 363 |
-
print(f" Retrieval: {retrieval_time:.2f}s")
|
| 364 |
-
print(f" LLM: {llm_time:.2f}s")
|
| 365 |
-
print(f" Total: {total_time:.2f}s")
|
| 366 |
-
|
| 367 |
-
return result
|
| 368 |
-
```
|
| 369 |
-
|
| 370 |
-
This will help you identify the exact bottleneck!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/guides/modal-rag-sequence.md
DELETED
|
@@ -1,168 +0,0 @@
|
|
| 1 |
-
# Modal RAG System - Sequence Diagrams
|
| 2 |
-
|
| 3 |
-
This document provides sequence diagrams for the Modal RAG (Retrieval Augmented Generation) application.
|
| 4 |
-
|
| 5 |
-
## 1. Indexing Flow (create_vector_db)
|
| 6 |
-
|
| 7 |
-
```mermaid
|
| 8 |
-
sequenceDiagram
|
| 9 |
-
participant User
|
| 10 |
-
participant Modal
|
| 11 |
-
participant CreateVectorDB as create_vector_db()
|
| 12 |
-
participant PDFLoader
|
| 13 |
-
participant TextSplitter
|
| 14 |
-
participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
|
| 15 |
-
participant ChromaDB as Remote ChromaDB
|
| 16 |
-
|
| 17 |
-
User->>Modal: modal run modal-rag.py::index
|
| 18 |
-
Modal->>CreateVectorDB: Execute function
|
| 19 |
-
|
| 20 |
-
CreateVectorDB->>PDFLoader: Load PDFs from /insurance-data
|
| 21 |
-
PDFLoader-->>CreateVectorDB: Return documents
|
| 22 |
-
|
| 23 |
-
CreateVectorDB->>TextSplitter: Split documents (chunk_size=1000)
|
| 24 |
-
TextSplitter-->>CreateVectorDB: Return chunks
|
| 25 |
-
|
| 26 |
-
CreateVectorDB->>Embeddings: Initialize (device='cuda')
|
| 27 |
-
CreateVectorDB->>Embeddings: Generate embeddings for chunks
|
| 28 |
-
Embeddings-->>CreateVectorDB: Return embeddings
|
| 29 |
-
|
| 30 |
-
CreateVectorDB->>ChromaDB: Connect to remote service
|
| 31 |
-
CreateVectorDB->>ChromaDB: Upsert chunks + embeddings
|
| 32 |
-
ChromaDB-->>CreateVectorDB: Confirm storage
|
| 33 |
-
|
| 34 |
-
CreateVectorDB-->>Modal: Complete
|
| 35 |
-
Modal-->>User: Success message
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
## 2. Query Flow (RAGModel.query)
|
| 39 |
-
|
| 40 |
-
```mermaid
|
| 41 |
-
sequenceDiagram
|
| 42 |
-
participant User
|
| 43 |
-
participant Modal
|
| 44 |
-
participant QueryEntrypoint as query()
|
| 45 |
-
participant RAGModel
|
| 46 |
-
participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
|
| 47 |
-
participant ChromaRetriever as RemoteChromaRetriever
|
| 48 |
-
participant ChromaDB as Remote ChromaDB
|
| 49 |
-
participant LLM as Mistral-7B<br/>(A10G GPU)
|
| 50 |
-
participant RAGChain as LangChain RAG
|
| 51 |
-
|
| 52 |
-
User->>Modal: modal run modal-rag.py::query --question "..."
|
| 53 |
-
Modal->>QueryEntrypoint: Execute local entrypoint
|
| 54 |
-
QueryEntrypoint->>RAGModel: Instantiate RAGModel()
|
| 55 |
-
|
| 56 |
-
Note over RAGModel: @modal.enter() lifecycle
|
| 57 |
-
RAGModel->>Embeddings: Load embedding model (CUDA)
|
| 58 |
-
RAGModel->>ChromaDB: Connect to remote service
|
| 59 |
-
RAGModel->>LLM: Load Mistral-7B (A10G GPU)
|
| 60 |
-
RAGModel->>RAGModel: Initialize RemoteChromaRetriever
|
| 61 |
-
|
| 62 |
-
QueryEntrypoint->>RAGModel: query.remote(question)
|
| 63 |
-
|
| 64 |
-
RAGModel->>ChromaRetriever: Create retriever instance
|
| 65 |
-
RAGModel->>RAGChain: Build RAG chain
|
| 66 |
-
|
| 67 |
-
RAGChain->>ChromaRetriever: Retrieve relevant docs
|
| 68 |
-
ChromaRetriever->>Embeddings: embed_query(question)
|
| 69 |
-
Embeddings-->>ChromaRetriever: Query embedding
|
| 70 |
-
ChromaRetriever->>ChromaDB: query(embedding, k=3)
|
| 71 |
-
ChromaDB-->>ChromaRetriever: Top-k documents
|
| 72 |
-
ChromaRetriever-->>RAGChain: Return documents
|
| 73 |
-
|
| 74 |
-
RAGChain->>LLM: Generate answer with context
|
| 75 |
-
LLM-->>RAGChain: Generated answer
|
| 76 |
-
RAGChain-->>RAGModel: Return result
|
| 77 |
-
|
| 78 |
-
RAGModel-->>QueryEntrypoint: Return {answer, sources}
|
| 79 |
-
QueryEntrypoint-->>User: Display answer + sources
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
## 3. Web Endpoint Flow (RAGModel.web_query)
|
| 83 |
-
|
| 84 |
-
```mermaid
|
| 85 |
-
sequenceDiagram
|
| 86 |
-
participant User
|
| 87 |
-
participant Browser
|
| 88 |
-
participant Modal as Modal Platform
|
| 89 |
-
participant WebEndpoint as RAGModel.web_query
|
| 90 |
-
participant QueryMethod as RAGModel.query
|
| 91 |
-
participant RAGChain
|
| 92 |
-
participant ChromaDB
|
| 93 |
-
participant LLM
|
| 94 |
-
|
| 95 |
-
User->>Browser: GET https://.../web_query?question=...
|
| 96 |
-
Browser->>Modal: HTTP GET request
|
| 97 |
-
Modal->>WebEndpoint: Route to @modal.fastapi_endpoint
|
| 98 |
-
|
| 99 |
-
WebEndpoint->>QueryMethod: Call query.local(question)
|
| 100 |
-
|
| 101 |
-
Note over QueryMethod,LLM: Same flow as Query diagram
|
| 102 |
-
QueryMethod->>RAGChain: Build chain
|
| 103 |
-
RAGChain->>ChromaDB: Retrieve docs
|
| 104 |
-
RAGChain->>LLM: Generate answer
|
| 105 |
-
LLM-->>QueryMethod: Return result
|
| 106 |
-
|
| 107 |
-
QueryMethod-->>WebEndpoint: Return {answer, sources}
|
| 108 |
-
WebEndpoint-->>Modal: JSON response
|
| 109 |
-
Modal-->>Browser: HTTP 200 + JSON
|
| 110 |
-
Browser-->>User: Display result
|
| 111 |
-
```
|
| 112 |
-
|
| 113 |
-
## 4. Container Lifecycle (RAGModel)
|
| 114 |
-
|
| 115 |
-
```mermaid
|
| 116 |
-
sequenceDiagram
|
| 117 |
-
participant Modal
|
| 118 |
-
participant Container
|
| 119 |
-
participant RAGModel
|
| 120 |
-
participant GPU as A10G GPU
|
| 121 |
-
participant Volume as Modal Volume
|
| 122 |
-
participant ChromaDB
|
| 123 |
-
|
| 124 |
-
Modal->>Container: Start container (min_containers=1)
|
| 125 |
-
Container->>GPU: Allocate GPU
|
| 126 |
-
Container->>Volume: Mount /insurance-data
|
| 127 |
-
|
| 128 |
-
Container->>RAGModel: Call @modal.enter()
|
| 129 |
-
|
| 130 |
-
Note over RAGModel: Initialization phase
|
| 131 |
-
RAGModel->>RAGModel: Load HuggingFaceEmbeddings (CUDA)
|
| 132 |
-
RAGModel->>ChromaDB: Connect to remote service
|
| 133 |
-
RAGModel->>RAGModel: Load Mistral-7B (GPU)
|
| 134 |
-
RAGModel->>RAGModel: Create RemoteChromaRetriever class
|
| 135 |
-
|
| 136 |
-
RAGModel-->>Container: Ready
|
| 137 |
-
Container-->>Modal: Container warm and ready
|
| 138 |
-
|
| 139 |
-
Note over Modal,Container: Container stays warm (min_containers=1)
|
| 140 |
-
|
| 141 |
-
loop Handle requests
|
| 142 |
-
Modal->>RAGModel: Invoke query() method
|
| 143 |
-
RAGModel-->>Modal: Return result
|
| 144 |
-
end
|
| 145 |
-
|
| 146 |
-
Note over Modal,Container: Container persists until scaled down
|
| 147 |
-
```
|
| 148 |
-
|
| 149 |
-
## Key Components
|
| 150 |
-
|
| 151 |
-
### Modal Configuration
|
| 152 |
-
- **App Name**: `insurance-rag`
|
| 153 |
-
- **Volume**: `mcp-hack-ins-products` mounted at `/insurance-data`
|
| 154 |
-
- **GPU**: A10G for RAGModel class
|
| 155 |
-
- **Autoscaling**: `min_containers=1`, `max_containers=1` (always warm)
|
| 156 |
-
|
| 157 |
-
### Models
|
| 158 |
-
- **LLM**: `mistralai/Mistral-7B-Instruct-v0.3` (GPU, float16)
|
| 159 |
-
- **Embeddings**: `BAAI/bge-small-en-v1.5` (GPU, CUDA)
|
| 160 |
-
|
| 161 |
-
### Storage
|
| 162 |
-
- **Vector DB**: Remote ChromaDB service (`chroma-server-v2`)
|
| 163 |
-
- **Collection**: `insurance_products`
|
| 164 |
-
- **Chunk Size**: 1000 characters with 200 overlap
|
| 165 |
-
|
| 166 |
-
### Endpoints
|
| 167 |
-
- **Local Entrypoints**: `list`, `index`, `query`
|
| 168 |
-
- **Web Endpoint**: `RAGModel.web_query` (FastAPI GET endpoint)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/guides/next_steps_rag_recommendation.md
DELETED
|
@@ -1,77 +0,0 @@
|
|
| 1 |
-
# Next Steps: RAG for Product Design Document
|
| 2 |
-
|
| 3 |
-
## Should You Add RAG?
|
| 4 |
-
|
| 5 |
-
**Recommendation: YES, but with specific use cases in mind**
|
| 6 |
-
|
| 7 |
-
### Benefits of Adding RAG:
|
| 8 |
-
|
| 9 |
-
1. **Requirements Extraction**: Quickly find specific requirements from the 1,600-line document
|
| 10 |
-
2. **Stakeholder Q&A**: Answer questions like "What's the premium for a 28-year-old in Shibuya?"
|
| 11 |
-
3. **Design Validation**: Query coverage details, pricing tiers, compliance requirements
|
| 12 |
-
4. **Development Planning**: Extract technical requirements, API specs, integration needs
|
| 13 |
-
5. **Competitive Analysis**: Compare your product features vs competitors mentioned in the doc
|
| 14 |
-
|
| 15 |
-
### When RAG is NOT Needed:
|
| 16 |
-
|
| 17 |
-
- If you just need to read/search the document manually
|
| 18 |
-
- If the document is small enough to navigate easily
|
| 19 |
-
- If you don't need to answer complex questions across multiple sections
|
| 20 |
-
|
| 21 |
-
## Implementation Options
|
| 22 |
-
|
| 23 |
-
### Option 1: Extend Existing Modal RAG (Recommended)
|
| 24 |
-
- Your existing `modal-rag.py` already handles PDFs
|
| 25 |
-
- Can easily add support for markdown/Word documents
|
| 26 |
-
- Leverages existing ChromaDB infrastructure
|
| 27 |
-
- **Effort**: Low (30-60 minutes)
|
| 28 |
-
|
| 29 |
-
### Option 2: Simple Document Search
|
| 30 |
-
- Use grep/search tools for simple queries
|
| 31 |
-
- **Effort**: None (already available)
|
| 32 |
-
|
| 33 |
-
### Option 3: Full RAG with Fine-Tuning
|
| 34 |
-
- Fine-tune model on insurance domain + your product spec
|
| 35 |
-
- **Effort**: High (days/weeks)
|
| 36 |
-
- **Benefit**: Best accuracy for insurance-specific queries
|
| 37 |
-
|
| 38 |
-
## Recommended Next Steps
|
| 39 |
-
|
| 40 |
-
1. **Add Product Design Doc to RAG** (30 min)
|
| 41 |
-
- Extend `modal-rag.py` to load markdown/Word docs
|
| 42 |
-
- Index the filled product design document
|
| 43 |
-
- Test with sample queries
|
| 44 |
-
|
| 45 |
-
2. **Create Query Interface** (1-2 hours)
|
| 46 |
-
- Simple CLI or web interface
|
| 47 |
-
- Example queries:
|
| 48 |
-
- "What are the three product tiers and their premium ranges?"
|
| 49 |
-
- "What coverage does the Standard tier include?"
|
| 50 |
-
- "What are the Year 3 financial projections?"
|
| 51 |
-
|
| 52 |
-
3. **Use Cases to Test**:
|
| 53 |
-
- Requirements extraction for development
|
| 54 |
-
- Pricing questions for sales team
|
| 55 |
-
- Compliance checklist generation
|
| 56 |
-
- Feature comparison queries
|
| 57 |
-
|
| 58 |
-
## Quick Decision Matrix
|
| 59 |
-
|
| 60 |
-
| Use Case | RAG Needed? | Alternative |
|
| 61 |
-
|----------|-------------|-------------|
|
| 62 |
-
| Find specific section | β No | Use table of contents |
|
| 63 |
-
| Answer "What's the premium for X?" | β
Yes | Manual search |
|
| 64 |
-
| Extract all requirements | β
Yes | Manual extraction |
|
| 65 |
-
| Compare product tiers | β
Yes | Manual comparison |
|
| 66 |
-
| Generate compliance checklist | β
Yes | Manual review |
|
| 67 |
-
| Simple fact lookup | β οΈ Maybe | Grep/search |
|
| 68 |
-
|
| 69 |
-
## Recommendation
|
| 70 |
-
|
| 71 |
-
**Start with Option 1**: Extend your existing RAG to include the product design document. It's low effort, leverages existing infrastructure, and gives you the ability to query the spec as you develop the product.
|
| 72 |
-
|
| 73 |
-
Would you like me to:
|
| 74 |
-
1. Extend `modal-rag.py` to support the product design document?
|
| 75 |
-
2. Create a simple query interface?
|
| 76 |
-
3. Both?
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
{scripts β src}/__init__.py
RENAMED
|
File without changes
|
{docs β src/data}/clean_sample.py
RENAMED
|
File without changes
|
{scripts β src}/data/cleanup_data.py
RENAMED
|
File without changes
|
{scripts β src}/data/clear_census_volume.py
RENAMED
|
File without changes
|
{scripts β src}/data/convert_census_to_csv.py
RENAMED
|
File without changes
|
{scripts β src}/data/convert_economy_labor_to_csv.py
RENAMED
|
File without changes
|
{scripts β src}/data/convert_to_word.py
RENAMED
|
File without changes
|
{scripts β src}/data/create_custom_qa.py
RENAMED
|
File without changes
|
{docs β src/data}/debug_parser.py
RENAMED
|
File without changes
|
{scripts β src}/data/delete_census_csvs.py
RENAMED
|
File without changes
|
{scripts β src}/data/download_census_api.py
RENAMED
|
File without changes
|
{scripts β src}/data/download_census_csv_modal.py
RENAMED
|
File without changes
|
{scripts β src}/data/download_census_data.py
RENAMED
|
File without changes
|
{scripts β src}/data/download_census_modal.py
RENAMED
|
File without changes
|
{scripts β src}/data/download_economy_labor_modal.py
RENAMED
|
File without changes
|
{scripts β src}/data/fix_csv_filenames.py
RENAMED
|
File without changes
|
{scripts β src}/data/prepare_economy_data.py
RENAMED
|
File without changes
|
{scripts β src}/data/prepare_finetune_data.py
RENAMED
|
File without changes
|
{scripts β src}/data/remove_duplicate_csvs.py
RENAMED
|
File without changes
|