Veeru-c commited on
Commit
06bd253
Β·
1 Parent(s): a428e27

initial commit

Browse files
This view is limited to 50 files because it contains too many changes. Β  See raw diff
Files changed (50) hide show
  1. MIGRATION_GUIDE.md +0 -81
  2. NEXT_STEPS.md +0 -174
  3. diagrams/1-indexing-flow.mmd +0 -28
  4. diagrams/1-indexing-flow.svg +99 -1
  5. diagrams/2-query-flow-medium.mmd +0 -25
  6. diagrams/2-query-flow-medium.svg +106 -1
  7. diagrams/2-query-flow-simple.mmd +0 -19
  8. diagrams/2-query-flow-simple.svg +105 -1
  9. diagrams/2-query-flow.mmd +0 -39
  10. diagrams/2-query-flow.svg +138 -1
  11. diagrams/3-web-endpoint-flow.mmd +0 -26
  12. diagrams/3-web-endpoint-flow.svg +94 -1
  13. diagrams/4-container-lifecycle.mmd +0 -31
  14. diagrams/4-container-lifecycle.svg +118 -1
  15. diagrams/finetuning.svg +101 -166
  16. docs/NEXT_STEPS.md +181 -0
  17. QUICK_START.md β†’ docs/QUICK_START.md +0 -0
  18. docs/QUICK_START_API.md +75 -0
  19. docs/README.md +58 -0
  20. README_RAG.md β†’ docs/README_RAG.md +0 -0
  21. STRUCTURE.md β†’ docs/STRUCTURE.md +0 -0
  22. TESTING.md β†’ docs/TESTING.md +0 -0
  23. VLLM_MIGRATION.md β†’ docs/VLLM_MIGRATION.md +0 -0
  24. docs/api/RAG_API.md +244 -0
  25. docs/deployment/ADD_GUIDES_TO_RAG.md +0 -146
  26. docs/guides/HOW_TO_RUN.md +0 -215
  27. docs/guides/SETUP_SUCCESS.md +0 -63
  28. docs/guides/SUMMARY.md +0 -114
  29. docs/guides/modal-rag-optimization.md +0 -370
  30. docs/guides/modal-rag-sequence.md +0 -168
  31. docs/guides/next_steps_rag_recommendation.md +0 -77
  32. {scripts β†’ src}/__init__.py +0 -0
  33. {docs β†’ src/data}/clean_sample.py +0 -0
  34. {scripts β†’ src}/data/cleanup_data.py +0 -0
  35. {scripts β†’ src}/data/clear_census_volume.py +0 -0
  36. {scripts β†’ src}/data/convert_census_to_csv.py +0 -0
  37. {scripts β†’ src}/data/convert_economy_labor_to_csv.py +0 -0
  38. {scripts β†’ src}/data/convert_to_word.py +0 -0
  39. {scripts β†’ src}/data/create_custom_qa.py +0 -0
  40. {docs β†’ src/data}/debug_parser.py +0 -0
  41. {scripts β†’ src}/data/delete_census_csvs.py +0 -0
  42. {scripts β†’ src}/data/download_census_api.py +0 -0
  43. {scripts β†’ src}/data/download_census_csv_modal.py +0 -0
  44. {scripts β†’ src}/data/download_census_data.py +0 -0
  45. {scripts β†’ src}/data/download_census_modal.py +0 -0
  46. {scripts β†’ src}/data/download_economy_labor_modal.py +0 -0
  47. {scripts β†’ src}/data/fix_csv_filenames.py +0 -0
  48. {scripts β†’ src}/data/prepare_economy_data.py +0 -0
  49. {scripts β†’ src}/data/prepare_finetune_data.py +0 -0
  50. {scripts β†’ src}/data/remove_duplicate_csvs.py +0 -0
MIGRATION_GUIDE.md DELETED
@@ -1,81 +0,0 @@
1
- # Repository Restructure Migration Guide
2
-
3
- ## What Changed
4
-
5
- The repository has been reorganized for better structure and maintainability.
6
-
7
- ## File Moves
8
-
9
- ### RAG System
10
- - `src/modal-rag.py` β†’ `src/rag/modal-rag.py`
11
- - `src/modal-rag-product-design.py` β†’ `src/rag/modal-rag-product-design.py`
12
-
13
- ### Web Application
14
- - `web_app.py` β†’ `src/web/web_app.py`
15
- - `query_product_design.py` β†’ `src/web/query_product_design.py`
16
- - `templates/` β†’ `src/web/templates/`
17
- - `static/` β†’ `src/web/static/`
18
-
19
- ### Scripts
20
- - Data processing scripts β†’ `scripts/data/`
21
- - Setup scripts β†’ `scripts/setup/`
22
- - Utility scripts β†’ `scripts/tools/`
23
-
24
- ### Documentation
25
- - All `.md` files β†’ `docs/guides/`
26
- - Product design docs β†’ `docs/product-design/`
27
-
28
- ### Tests
29
- - `test_*.py` β†’ `tests/`
30
-
31
- ## Updated Commands
32
-
33
- ### Old Commands (No longer work)
34
- ```bash
35
- python web_app.py
36
- modal run src/modal-rag-product-design.py::query_product_design
37
- ```
38
-
39
- ### New Commands
40
- ```bash
41
- # Web app
42
- python src/web/web_app.py
43
- # Or use helper script
44
- ./scripts/setup/start_web.sh
45
-
46
- # Modal RAG
47
- modal run src/rag/modal-rag-product-design.py::query_product_design --question "your question"
48
-
49
- # Indexing
50
- modal run src/rag/modal-rag-product-design.py::index_product_design
51
- ```
52
-
53
- ## Import Path Updates
54
-
55
- If you have custom scripts that import from these modules, update the imports:
56
-
57
- ```python
58
- # Old
59
- from query_product_design import query_rag
60
-
61
- # New
62
- import sys
63
- sys.path.insert(0, 'src/web')
64
- from query_product_design import query_rag
65
- ```
66
-
67
- ## Next Steps
68
-
69
- 1. Update any custom scripts with new import paths
70
- 2. Update CI/CD pipelines if applicable
71
- 3. Update documentation references
72
- 4. Test all functionality
73
-
74
- ## Rollback
75
-
76
- If you need to rollback, all files are still in git history. You can:
77
- ```bash
78
- git log --oneline --all -- "old/path/to/file"
79
- git checkout <commit-hash> -- "old/path/to/file"
80
- ```
81
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NEXT_STEPS.md DELETED
@@ -1,174 +0,0 @@
1
- # Next Steps
2
-
3
- ## Current Status
4
-
5
- βœ… **Completed:**
6
- - Repository restructured and organized
7
- - RAG system configured (Word, PDF, Excel only - no markdown)
8
- - Web interface functional
9
- - Nebius deployment guide created
10
- - Documentation updated
11
-
12
- ## Immediate Next Steps
13
-
14
- ### 1. Test the Updated RAG System
15
-
16
- **Upload Product Design Documents:**
17
- ```bash
18
- # Upload Word document (if you have it)
19
- modal volume put mcp-hack-ins-products \
20
- docs/product-design/tokyo_auto_insurance_product_design.docx \
21
- docs/product-design/tokyo_auto_insurance_product_design.docx
22
-
23
- # Upload PDF (if you have one)
24
- modal volume put mcp-hack-ins-products \
25
- docs/product-design/tokyo_auto_insurance_product_design.pdf \
26
- docs/product-design/tokyo_auto_insurance_product_design.pdf
27
-
28
- # Upload Excel (if you have one)
29
- modal volume put mcp-hack-ins-products \
30
- docs/product-design/tokyo_auto_insurance_product_design.xlsx \
31
- docs/product-design/tokyo_auto_insurance_product_design.xlsx
32
- ```
33
-
34
- **Re-index Documents:**
35
- ```bash
36
- # Using CLI
37
- python src/web/query_product_design.py --index
38
-
39
- # Or direct Modal command
40
- modal run src/rag/modal-rag-product-design.py::index_product_design
41
- ```
42
-
43
- **Test Queries:**
44
- ```bash
45
- # Test via CLI
46
- python src/web/query_product_design.py --query "What are the three product tiers?"
47
-
48
- # Or start web interface
49
- python src/web/web_app.py
50
- # Then open http://127.0.0.1:5000 in browser
51
- ```
52
-
53
- ### 2. Verify File Processing
54
-
55
- Check that the system correctly:
56
- - βœ… Loads Word documents
57
- - βœ… Loads PDF documents (if uploaded)
58
- - βœ… Loads Excel files (if uploaded)
59
- - ❌ Ignores markdown files
60
- - ❌ Ignores other file types
61
-
62
- ### 3. Production Readiness
63
-
64
- **Option A: Continue with Modal (Current Setup)**
65
- - βœ… Already working
66
- - βœ… No changes needed
67
- - Just ensure documents are uploaded and indexed
68
-
69
- **Option B: Deploy to Nebius**
70
- - Review: `docs/deployment/NEBIUS_DEPLOYMENT.md`
71
- - Set up Nebius account
72
- - Deploy RAG service and web app
73
- - Migrate from Modal to Nebius
74
-
75
- ## Recommended Path Forward
76
-
77
- ### Short Term (This Week)
78
- 1. **Upload and index documents**
79
- - Ensure Word/PDF/Excel files are in Modal volume
80
- - Run indexing
81
- - Test queries
82
-
83
- 2. **Validate RAG quality**
84
- - Ask various product questions
85
- - Verify answer quality and accuracy
86
- - Check source citations
87
-
88
- 3. **Test web interface**
89
- - Start web app
90
- - Test from browser
91
- - Verify all features work
92
-
93
- ### Medium Term (Next 2 Weeks)
94
- 1. **Optimize RAG performance**
95
- - Monitor query times
96
- - Adjust chunk sizes if needed
97
- - Fine-tune retrieval parameters
98
-
99
- 2. **Add more documents** (if needed)
100
- - Upload additional product design files
101
- - Re-index as needed
102
-
103
- 3. **User testing**
104
- - Share with team/stakeholders
105
- - Gather feedback
106
- - Iterate on improvements
107
-
108
- ### Long Term (Next Month)
109
- 1. **Deploy to production**
110
- - Choose: Modal or Nebius
111
- - Set up monitoring
112
- - Configure auto-scaling (if needed)
113
-
114
- 2. **Enhance features**
115
- - Add authentication (if needed)
116
- - Add query history
117
- - Add export functionality
118
- - Add analytics
119
-
120
- 3. **Scale and optimize**
121
- - Monitor costs
122
- - Optimize for performance
123
- - Add caching if needed
124
-
125
- ## Quick Commands Reference
126
-
127
- ```bash
128
- # Index documents
129
- python src/web/query_product_design.py --index
130
-
131
- # Query via CLI
132
- python src/web/query_product_design.py --query "your question"
133
-
134
- # Start web interface
135
- python src/web/web_app.py
136
- # Or use helper script:
137
- ./scripts/setup/start_web.sh
138
-
139
- # Check Modal volume contents
140
- modal volume list mcp-hack-ins-products
141
- ```
142
-
143
- ## Decision Points
144
-
145
- 1. **Deployment Platform:**
146
- - [ ] Stay with Modal (current)
147
- - [ ] Migrate to Nebius
148
- - [ ] Use both (hybrid)
149
-
150
- 2. **Document Management:**
151
- - [ ] Keep documents in Modal volume
152
- - [ ] Move to object storage (S3, etc.)
153
- - [ ] Use version control
154
-
155
- 3. **Access Control:**
156
- - [ ] Public access (current)
157
- - [ ] Add authentication
158
- - [ ] Add role-based access
159
-
160
- ## Questions to Consider
161
-
162
- - Do you have Word/PDF/Excel versions of your product design documents?
163
- - Do you need to convert markdown files to Word/PDF format?
164
- - Are you ready to deploy to production?
165
- - Do you need authentication/access control?
166
- - What's your target user base?
167
-
168
- ## Getting Help
169
-
170
- - **Documentation:** See `docs/` directory
171
- - **Troubleshooting:** See `docs/guides/TROUBLESHOOTING.md`
172
- - **Deployment:** See `docs/deployment/NEBIUS_DEPLOYMENT.md`
173
- - **Quick Start:** See `QUICK_START.md`
174
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
diagrams/1-indexing-flow.mmd DELETED
@@ -1,28 +0,0 @@
1
- sequenceDiagram
2
- participant User
3
- participant Modal
4
- participant CreateVectorDB as create_vector_db()
5
- participant PDFLoader
6
- participant TextSplitter
7
- participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
8
- participant ChromaDB as Remote ChromaDB
9
-
10
- User->>Modal: modal run modal-rag.py::index
11
- Modal->>CreateVectorDB: Execute function
12
-
13
- CreateVectorDB->>PDFLoader: Load PDFs from /insurance-data
14
- PDFLoader-->>CreateVectorDB: Return documents
15
-
16
- CreateVectorDB->>TextSplitter: Split documents (chunk_size=1000)
17
- TextSplitter-->>CreateVectorDB: Return chunks
18
-
19
- CreateVectorDB->>Embeddings: Initialize (device='cuda')
20
- CreateVectorDB->>Embeddings: Generate embeddings for chunks
21
- Embeddings-->>CreateVectorDB: Return embeddings
22
-
23
- CreateVectorDB->>ChromaDB: Connect to remote service
24
- CreateVectorDB->>ChromaDB: Upsert chunks + embeddings
25
- ChromaDB-->>CreateVectorDB: Confirm storage
26
-
27
- CreateVectorDB-->>Modal: Complete
28
- Modal-->>User: Success message
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
diagrams/1-indexing-flow.svg CHANGED
diagrams/2-query-flow-medium.mmd DELETED
@@ -1,25 +0,0 @@
1
- sequenceDiagram
2
- participant User
3
- participant Modal
4
- participant RAGModel
5
- participant Embeddings
6
- participant ChromaDB
7
- participant LLM
8
-
9
- User->>Modal: modal run query --question "..."
10
-
11
- Note over Modal,RAGModel: Container Startup (if cold)
12
- Modal->>RAGModel: Initialize
13
- RAGModel->>Embeddings: Load embedding model (GPU)
14
- RAGModel->>LLM: Load Mistral-7B (GPU)
15
-
16
- Note over Modal,LLM: Query Processing
17
- Modal->>RAGModel: Process question
18
- RAGModel->>Embeddings: Convert question to vector
19
- RAGModel->>ChromaDB: Search similar documents
20
- ChromaDB-->>RAGModel: Top 3 matching docs
21
-
22
- RAGModel->>LLM: Generate answer + context
23
- LLM-->>RAGModel: Answer
24
-
25
- RAGModel-->>User: Display answer + sources
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
diagrams/2-query-flow-medium.svg CHANGED
diagrams/2-query-flow-simple.mmd DELETED
@@ -1,19 +0,0 @@
1
- sequenceDiagram
2
- participant User
3
- participant Modal
4
- participant RAGModel
5
- participant ChromaDB
6
- participant LLM as Mistral-7B
7
-
8
- User->>Modal: Ask question
9
- Modal->>RAGModel: Initialize (warm container)
10
-
11
- Note over RAGModel: Load models on GPU
12
-
13
- RAGModel->>ChromaDB: Search for relevant docs
14
- ChromaDB-->>RAGModel: Return top 3 documents
15
-
16
- RAGModel->>LLM: Generate answer with context
17
- LLM-->>RAGModel: Generated answer
18
-
19
- RAGModel-->>User: Answer + Sources
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
diagrams/2-query-flow-simple.svg CHANGED
diagrams/2-query-flow.mmd DELETED
@@ -1,39 +0,0 @@
1
- sequenceDiagram
2
- participant User
3
- participant Modal
4
- participant QueryEntrypoint as query()
5
- participant RAGModel
6
- participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
7
- participant ChromaRetriever as RemoteChromaRetriever
8
- participant ChromaDB as Remote ChromaDB
9
- participant LLM as Mistral-7B<br/>(A10G GPU)
10
- participant RAGChain as LangChain RAG
11
-
12
- User->>Modal: modal run modal-rag.py::query --question "..."
13
- Modal->>QueryEntrypoint: Execute local entrypoint
14
- QueryEntrypoint->>RAGModel: Instantiate RAGModel()
15
-
16
- Note over RAGModel: @modal.enter() lifecycle
17
- RAGModel->>Embeddings: Load embedding model (CUDA)
18
- RAGModel->>ChromaDB: Connect to remote service
19
- RAGModel->>LLM: Load Mistral-7B (A10G GPU)
20
- RAGModel->>RAGModel: Initialize RemoteChromaRetriever
21
-
22
- QueryEntrypoint->>RAGModel: query.remote(question)
23
-
24
- RAGModel->>ChromaRetriever: Create retriever instance
25
- RAGModel->>RAGChain: Build RAG chain
26
-
27
- RAGChain->>ChromaRetriever: Retrieve relevant docs
28
- ChromaRetriever->>Embeddings: embed_query(question)
29
- Embeddings-->>ChromaRetriever: Query embedding
30
- ChromaRetriever->>ChromaDB: query(embedding, k=3)
31
- ChromaDB-->>ChromaRetriever: Top-k documents
32
- ChromaRetriever-->>RAGChain: Return documents
33
-
34
- RAGChain->>LLM: Generate answer with context
35
- LLM-->>RAGChain: Generated answer
36
- RAGChain-->>RAGModel: Return result
37
-
38
- RAGModel-->>QueryEntrypoint: Return {answer, sources}
39
- QueryEntrypoint-->>User: Display answer + sources
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
diagrams/2-query-flow.svg CHANGED
diagrams/3-web-endpoint-flow.mmd DELETED
@@ -1,26 +0,0 @@
1
- sequenceDiagram
2
- participant User
3
- participant Browser
4
- participant Modal as Modal Platform
5
- participant WebEndpoint as RAGModel.web_query
6
- participant QueryMethod as RAGModel.query
7
- participant RAGChain
8
- participant ChromaDB
9
- participant LLM
10
-
11
- User->>Browser: GET https://.../web_query?question=...
12
- Browser->>Modal: HTTP GET request
13
- Modal->>WebEndpoint: Route to @modal.fastapi_endpoint
14
-
15
- WebEndpoint->>QueryMethod: Call query.local(question)
16
-
17
- Note over QueryMethod,LLM: Same flow as Query diagram
18
- QueryMethod->>RAGChain: Build chain
19
- RAGChain->>ChromaDB: Retrieve docs
20
- RAGChain->>LLM: Generate answer
21
- LLM-->>QueryMethod: Return result
22
-
23
- QueryMethod-->>WebEndpoint: Return {answer, sources}
24
- WebEndpoint-->>Modal: JSON response
25
- Modal-->>Browser: HTTP 200 + JSON
26
- Browser-->>User: Display result
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
diagrams/3-web-endpoint-flow.svg CHANGED
diagrams/4-container-lifecycle.mmd DELETED
@@ -1,31 +0,0 @@
1
- sequenceDiagram
2
- participant Modal
3
- participant Container
4
- participant RAGModel
5
- participant GPU as A10G GPU
6
- participant Volume as Modal Volume
7
- participant ChromaDB
8
-
9
- Modal->>Container: Start container (min_containers=1)
10
- Container->>GPU: Allocate GPU
11
- Container->>Volume: Mount /insurance-data
12
-
13
- Container->>RAGModel: Call @modal.enter()
14
-
15
- Note over RAGModel: Initialization phase
16
- RAGModel->>RAGModel: Load HuggingFaceEmbeddings (CUDA)
17
- RAGModel->>ChromaDB: Connect to remote service
18
- RAGModel->>RAGModel: Load Mistral-7B (GPU)
19
- RAGModel->>RAGModel: Create RemoteChromaRetriever class
20
-
21
- RAGModel-->>Container: Ready
22
- Container-->>Modal: Container warm and ready
23
-
24
- Note over Modal,Container: Container stays warm (min_containers=1)
25
-
26
- loop Handle requests
27
- Modal->>RAGModel: Invoke query() method
28
- RAGModel-->>Modal: Return result
29
- end
30
-
31
- Note over Modal,Container: Container persists until scaled down
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
diagrams/4-container-lifecycle.svg CHANGED
diagrams/finetuning.svg CHANGED
docs/NEXT_STEPS.md ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Next Steps & Roadmap
2
+
3
+ ## βœ… Current Status
4
+
5
+ **Completed:**
6
+ - Fine-tuning pipeline with vLLM optimization
7
+ - RAG system with local ChromaDB
8
+ - High-performance inference (<3s latency)
9
+ - Model merging for production deployment
10
+ - Comprehensive documentation
11
+
12
+ ## 🎯 Immediate Next Steps
13
+
14
+ ### 1. Test Fine-Tuned Model Performance
15
+
16
+ ```bash
17
+ # Test the vLLM-optimized endpoint
18
+ curl -X POST https://mcp-hack--phi3-inference-vllm-model-ask.modal.run \
19
+ -H "Content-Type: application/json" \
20
+ -d '{"question": "What is the population of Tokyo?", "context": "Japan Census data"}'
21
+ ```
22
+
23
+ ### 2. Test RAG System
24
+
25
+ ```bash
26
+ # Test the RAG endpoint
27
+ curl -X POST https://mcp-hack--rag-vllm-optimized-ragmodel-query.modal.run \
28
+ -H "Content-Type: application/json" \
29
+ -d '{"question": "What insurance products are available?"}'
30
+ ```
31
+
32
+ ### 3. Monitor Performance
33
+
34
+ - Check latency metrics in responses
35
+ - Verify <3s response times
36
+ - Monitor GPU utilization on Modal dashboard
37
+
38
+ ## πŸš€ Short Term (This Week)
39
+
40
+ ### Fine-Tuning Improvements
41
+ - [ ] Run evaluation script to assess model quality
42
+ - [ ] Collect more training data if needed
43
+ - [ ] Experiment with different LoRA parameters
44
+ - [ ] Test on diverse queries
45
+
46
+ ### RAG Enhancements
47
+ - [ ] Add more insurance documents to volume
48
+ - [ ] Re-index with updated documents
49
+ - [ ] Test retrieval quality
50
+ - [ ] Optimize chunk sizes if needed
51
+
52
+ ### Documentation
53
+ - [ ] Add API usage examples
54
+ - [ ] Create deployment guide
55
+ - [ ] Document troubleshooting steps
56
+
57
+ ## πŸ“Š Medium Term (Next 2 Weeks)
58
+
59
+ ### Model Optimization
60
+ 1. **Fine-tuning iterations**
61
+ - Analyze evaluation results
62
+ - Adjust training parameters
63
+ - Re-train if needed
64
+
65
+ 2. **RAG improvements**
66
+ - Experiment with different embedding models
67
+ - Optimize retrieval parameters (top-k, similarity threshold)
68
+ - Add query rewriting
69
+
70
+ 3. **Performance monitoring**
71
+ - Set up logging
72
+ - Track latency trends
73
+ - Monitor costs
74
+
75
+ ### Feature Additions
76
+ - [ ] Add streaming responses
77
+ - [ ] Implement caching layer
78
+ - [ ] Add query history
79
+ - [ ] Create admin dashboard
80
+
81
+ ## 🎨 Long Term (Next Month)
82
+
83
+ ### Production Readiness
84
+ 1. **Deployment**
85
+ - Set up CI/CD pipeline
86
+ - Configure monitoring and alerts
87
+ - Implement rate limiting
88
+ - Add authentication if needed
89
+
90
+ 2. **Scaling**
91
+ - Optimize container scaling
92
+ - Implement load balancing
93
+ - Add caching (Redis)
94
+ - Set up CDN for static assets
95
+
96
+ 3. **Advanced Features**
97
+ - Multi-modal support (images, tables)
98
+ - Batch processing
99
+ - A/B testing framework
100
+ - Analytics dashboard
101
+
102
+ ## πŸ”§ Technical Debt
103
+
104
+ - [ ] Remove `bkp/` directory (old backup files)
105
+ - [ ] Clean up unused dependencies
106
+ - [ ] Add comprehensive tests
107
+ - [ ] Improve error handling
108
+ - [ ] Add input validation
109
+
110
+ ## πŸ“ˆ Metrics to Track
111
+
112
+ **Performance:**
113
+ - Inference latency (target: <3s)
114
+ - Retrieval accuracy
115
+ - GPU utilization
116
+ - Cost per query
117
+
118
+ **Quality:**
119
+ - Model accuracy on evaluation set
120
+ - RAG relevance scores
121
+ - User satisfaction (if applicable)
122
+
123
+ ## πŸ€” Decision Points
124
+
125
+ 1. **Model Selection:**
126
+ - [ ] Continue with Phi-3-mini
127
+ - [ ] Experiment with larger models
128
+ - [ ] Try different base models
129
+
130
+ 2. **Infrastructure:**
131
+ - [ ] Stay with Modal (current)
132
+ - [ ] Migrate to other platform
133
+ - [ ] Self-hosted deployment
134
+
135
+ 3. **Data Strategy:**
136
+ - [ ] Expand training dataset
137
+ - [ ] Add domain-specific data
138
+ - [ ] Implement data versioning
139
+
140
+ ## πŸ“š Quick Reference
141
+
142
+ ### Key Commands
143
+ ```bash
144
+ # Fine-tuning
145
+ ./venv/bin/modal run src/finetune/finetune_modal.py
146
+
147
+ # Model merging
148
+ ./venv/bin/modal run src/finetune/merge_model.py
149
+
150
+ # Deploy vLLM endpoint (fine-tuned)
151
+ ./venv/bin/modal deploy src/finetune/api_endpoint_vllm.py
152
+
153
+ # Deploy RAG endpoint
154
+ ./venv/bin/modal deploy src/rag/rag_vllm.py
155
+
156
+ # Evaluation
157
+ ./venv/bin/modal run src/finetune/eval_finetuned.py
158
+ ```
159
+
160
+ ### Documentation
161
+ - **Main Guide:** `docs/HOW_TO_RUN.md`
162
+ - **Architecture:** `diagrams/` folder
163
+ - **Testing:** `docs/TESTING.md`
164
+ - **Agent Design:** `docs/agentdesign.md`
165
+
166
+ ## 🎯 Success Criteria
167
+
168
+ **Phase 1 (Current):**
169
+ - βœ… <3s inference latency
170
+ - βœ… vLLM optimization working
171
+ - βœ… RAG retrieval functional
172
+
173
+ **Phase 2 (Next):**
174
+ - [ ] >90% accuracy on evaluation set
175
+ - [ ] <2s average latency
176
+ - [ ] Production deployment complete
177
+
178
+ **Phase 3 (Future):**
179
+ - [ ] Multi-user support
180
+ - [ ] Advanced analytics
181
+ - [ ] Cost optimization (<$X per 1K queries)
QUICK_START.md β†’ docs/QUICK_START.md RENAMED
File without changes
docs/QUICK_START_API.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quick Start: RAG API
2
+
3
+ Fast API endpoint for querying product design documents with <3 second response times.
4
+
5
+ ## Deploy the API
6
+
7
+ ```bash
8
+ # Deploy to Modal
9
+ modal deploy src/rag/rag_api.py
10
+
11
+ # Get the API URL
12
+ modal app show insurance-rag-api
13
+ ```
14
+
15
+ ## Use the API
16
+
17
+ ### Python Client
18
+
19
+ ```python
20
+ from src.rag.api_client import RAGAPIClient
21
+
22
+ # Initialize client
23
+ client = RAGAPIClient(base_url="https://your-api-url.modal.run")
24
+
25
+ # Query
26
+ result = client.query("What are the three product tiers?")
27
+ print(result['answer'])
28
+ print(f"Response time: {result['total_time']:.2f}s")
29
+ ```
30
+
31
+ ### cURL
32
+
33
+ ```bash
34
+ curl -X POST https://your-api-url.modal.run/query \
35
+ -H "Content-Type: application/json" \
36
+ -d '{"question": "What are the three product tiers?"}'
37
+ ```
38
+
39
+ ### JavaScript
40
+
41
+ ```javascript
42
+ const response = await fetch('https://your-api-url.modal.run/query', {
43
+ method: 'POST',
44
+ headers: { 'Content-Type': 'application/json' },
45
+ body: JSON.stringify({ question: 'What are the three product tiers?' })
46
+ });
47
+
48
+ const data = await response.json();
49
+ console.log(data.answer);
50
+ ```
51
+
52
+ ## Test Performance
53
+
54
+ ```bash
55
+ # Test with default URL
56
+ python tests/test_api.py
57
+
58
+ # Test with custom URL
59
+ python tests/test_api.py --url https://your-api-url.modal.run
60
+ ```
61
+
62
+ ## Performance Target
63
+
64
+ - **Target**: <3 seconds per query
65
+ - **Typical**: 1.5-2.5 seconds
66
+ - **Optimizations**: Warm containers, reduced tokens, limited context
67
+
68
+ ## API Endpoints
69
+
70
+ - `GET /health` - Health check
71
+ - `POST /query` - Query the RAG system
72
+ - `GET /` - API information
73
+
74
+ See `docs/api/RAG_API.md` for full documentation.
75
+
docs/README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Documentation Index
2
+
3
+ This directory contains all project documentation.
4
+
5
+ ## πŸ“š Main Guides
6
+
7
+ ### Getting Started
8
+ - **[HOW_TO_RUN.md](HOW_TO_RUN.md)** - Complete guide to running the fine-tuning pipeline
9
+ - **[QUICK_START.md](QUICK_START.md)** - Quick start guide for the project
10
+ - **[QUICK_START_API.md](QUICK_START_API.md)** - API quick start guide
11
+
12
+ ### Fine-Tuning
13
+ - **[finetune/](../finetune/)** - Fine-tuning documentation and guides
14
+ - Data preparation
15
+ - Dataset generation
16
+ - Model training
17
+ - Evaluation
18
+
19
+ ### RAG System
20
+ - **[README_RAG.md](README_RAG.md)** - RAG system overview
21
+ - **[guides/QUICK_START_RAG.md](guides/QUICK_START_RAG.md)** - RAG quick start
22
+ - **[guides/RAG_SETUP_COMPLETE.md](guides/RAG_SETUP_COMPLETE.md)** - Complete RAG setup guide
23
+ - **[api/RAG_API.md](api/RAG_API.md)** - RAG API documentation
24
+
25
+ ### Deployment
26
+ - **[deployment/](deployment/)** - Deployment guides
27
+ - **[README.md](deployment/README.md)** - Deployment overview
28
+ - **[NEBIUS_DEPLOYMENT.md](deployment/NEBIUS_DEPLOYMENT.md)** - Nebius deployment guide
29
+
30
+ ### Reference
31
+ - **[STRUCTURE.md](STRUCTURE.md)** - Project structure overview
32
+ - **[TESTING.md](TESTING.md)** - Testing guide
33
+ - **[MIGRATION_GUIDE.md](MIGRATION_GUIDE.md)** - Migration guide
34
+ - **[VLLM_MIGRATION.md](VLLM_MIGRATION.md)** - vLLM migration guide
35
+ - **[NEXT_STEPS.md](NEXT_STEPS.md)** - Next steps and roadmap
36
+
37
+ ### Agent Design
38
+ - **[agentdesign.md](agentdesign.md)** - AI agent design for automated development workflow
39
+
40
+ ### Product Design
41
+ - **[product-design/](product-design/)** - Product design guides and examples
42
+ - Product decision guide
43
+ - RAG setup for product design
44
+ - Example: Tokyo auto insurance product design
45
+
46
+ ## πŸ”§ Additional Resources
47
+
48
+ ### Data Sources
49
+ - **[guides/estat_api_guide.md](guides/estat_api_guide.md)** - e-Stat API guide
50
+ - **[guides/source_data.md](guides/source_data.md)** - Data source documentation
51
+ - **[guides/ft_process.md](guides/ft_process.md)** - Fine-tuning process details
52
+
53
+ ### Troubleshooting
54
+ - **[guides/TROUBLESHOOTING.md](guides/TROUBLESHOOTING.md)** - General troubleshooting
55
+ - **[guides/WEB_TROUBLESHOOTING.md](guides/WEB_TROUBLESHOOTING.md)** - Web interface troubleshooting
56
+
57
+ ### Web Interface
58
+ - **[guides/WEB_INTERFACE.md](guides/WEB_INTERFACE.md)** - Web interface documentation
README_RAG.md β†’ docs/README_RAG.md RENAMED
File without changes
STRUCTURE.md β†’ docs/STRUCTURE.md RENAMED
File without changes
TESTING.md β†’ docs/TESTING.md RENAMED
File without changes
VLLM_MIGRATION.md β†’ docs/VLLM_MIGRATION.md RENAMED
File without changes
docs/api/RAG_API.md ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RAG API Documentation
2
+
3
+ Fast API endpoint for querying the product design RAG system with <3 second response times.
4
+
5
+ ## Quick Start
6
+
7
+ ### Deploy the API
8
+
9
+ ```bash
10
+ # Deploy to Modal
11
+ modal deploy src/rag/rag_api.py
12
+
13
+ # Get the URL
14
+ modal app list
15
+ ```
16
+
17
+ ### Use the API
18
+
19
+ ```python
20
+ from src.rag.api_client import RAGAPIClient
21
+
22
+ client = RAGAPIClient(base_url="https://your-modal-url.modal.run")
23
+ result = client.query("What are the three product tiers?")
24
+ print(result['answer'])
25
+ ```
26
+
27
+ ## API Endpoints
28
+
29
+ ### Health Check
30
+
31
+ ```http
32
+ GET /health
33
+ ```
34
+
35
+ **Response:**
36
+ ```json
37
+ {
38
+ "status": "healthy",
39
+ "service": "rag-api"
40
+ }
41
+ ```
42
+
43
+ ### Query
44
+
45
+ ```http
46
+ POST /query
47
+ Content-Type: application/json
48
+
49
+ {
50
+ "question": "What are the three product tiers?",
51
+ "top_k": 5,
52
+ "max_tokens": 1024
53
+ }
54
+ ```
55
+
56
+ **Response:**
57
+ ```json
58
+ {
59
+ "answer": "The three product tiers are...",
60
+ "retrieval_time": 0.45,
61
+ "generation_time": 1.23,
62
+ "total_time": 1.68,
63
+ "sources": [
64
+ {
65
+ "content": "...",
66
+ "metadata": {...}
67
+ }
68
+ ],
69
+ "success": true
70
+ }
71
+ ```
72
+
73
+ ## Performance Optimization
74
+
75
+ ### Target: <3 Second Responses
76
+
77
+ The API is optimized for fast responses:
78
+
79
+ 1. **Warm Containers**: `min_containers=1` keeps a container ready
80
+ 2. **Optimized LLM**: Reduced max_tokens (1024 vs 1536)
81
+ 3. **Limited Context**: Top 3 documents, 800 chars each
82
+ 4. **Prefix Caching**: Enabled for faster generation
83
+ 5. **Concurrent Requests**: Up to 10 concurrent requests
84
+
85
+ ### Response Time Breakdown
86
+
87
+ - **Retrieval**: 0.3-0.8 seconds
88
+ - **Generation**: 1.0-2.0 seconds
89
+ - **Total**: 1.5-3.0 seconds (target: <3s)
90
+
91
+ ## Usage Examples
92
+
93
+ ### Python Client
94
+
95
+ ```python
96
+ from src.rag.api_client import RAGAPIClient
97
+
98
+ # Initialize
99
+ client = RAGAPIClient(base_url="https://your-api-url.modal.run")
100
+
101
+ # Health check
102
+ health = client.health_check()
103
+ print(health)
104
+
105
+ # Query
106
+ result = client.query("What are the premium ranges?")
107
+ print(result['answer'])
108
+
109
+ # Fast query (optimized for speed)
110
+ result = client.query_fast("What are the three tiers?")
111
+ print(result['answer'])
112
+ ```
113
+
114
+ ### cURL
115
+
116
+ ```bash
117
+ # Health check
118
+ curl https://your-api-url.modal.run/health
119
+
120
+ # Query
121
+ curl -X POST https://your-api-url.modal.run/query \
122
+ -H "Content-Type: application/json" \
123
+ -d '{
124
+ "question": "What are the three product tiers?",
125
+ "top_k": 5,
126
+ "max_tokens": 1024
127
+ }'
128
+ ```
129
+
130
+ ### JavaScript/TypeScript
131
+
132
+ ```javascript
133
+ const response = await fetch('https://your-api-url.modal.run/query', {
134
+ method: 'POST',
135
+ headers: {
136
+ 'Content-Type': 'application/json',
137
+ },
138
+ body: JSON.stringify({
139
+ question: 'What are the three product tiers?',
140
+ top_k: 5,
141
+ max_tokens: 1024
142
+ })
143
+ });
144
+
145
+ const data = await response.json();
146
+ console.log(data.answer);
147
+ ```
148
+
149
+ ## Configuration
150
+
151
+ ### Environment Variables
152
+
153
+ - `MODAL_APP_NAME`: App name (default: "insurance-rag-api")
154
+ - `MODAL_VOLUME_NAME`: Volume name (default: "mcp-hack-ins-products")
155
+
156
+ ### API Parameters
157
+
158
+ - `question` (required): The question to ask
159
+ - `top_k` (optional, default: 5): Number of documents to retrieve
160
+ - `max_tokens` (optional, default: 1024): Maximum response length
161
+
162
+ ## Performance Tips
163
+
164
+ 1. **Use Fast Query**: For speed-critical applications, use `query_fast()` method
165
+ 2. **Reduce top_k**: Lower `top_k` (e.g., 3) for faster retrieval
166
+ 3. **Reduce max_tokens**: Lower `max_tokens` (e.g., 512) for faster generation
167
+ 4. **Cache Results**: Cache common queries client-side
168
+ 5. **Batch Requests**: If possible, batch multiple queries
169
+
170
+ ## Error Handling
171
+
172
+ ```python
173
+ result = client.query("your question")
174
+
175
+ if result.get("success"):
176
+ print(result['answer'])
177
+ else:
178
+ print(f"Error: {result.get('error', 'Unknown error')}")
179
+ ```
180
+
181
+ ## Monitoring
182
+
183
+ ### Response Times
184
+
185
+ Monitor the `total_time` field in responses:
186
+ - < 2s: Excellent
187
+ - 2-3s: Good (target)
188
+ - > 3s: May need optimization
189
+
190
+ ### Health Monitoring
191
+
192
+ ```python
193
+ health = client.health_check()
194
+ if health.get("status") != "healthy":
195
+ # Handle unhealthy state
196
+ pass
197
+ ```
198
+
199
+ ## Deployment
200
+
201
+ ### Modal Deployment
202
+
203
+ ```bash
204
+ # Deploy
205
+ modal deploy src/rag/rag_api.py
206
+
207
+ # Get URL
208
+ modal app show insurance-rag-api
209
+ ```
210
+
211
+ ### Local Testing
212
+
213
+ ```bash
214
+ # Run locally (for development)
215
+ modal serve src/rag/rag_api.py
216
+ ```
217
+
218
+ ## Rate Limiting
219
+
220
+ The API supports up to 10 concurrent requests. For higher throughput:
221
+ - Deploy multiple instances
222
+ - Use load balancer
223
+ - Implement client-side rate limiting
224
+
225
+ ## Security
226
+
227
+ - Add authentication if needed
228
+ - Use HTTPS in production
229
+ - Implement rate limiting
230
+ - Validate input questions
231
+
232
+ ## Troubleshooting
233
+
234
+ ### Slow Responses (>3s)
235
+ - Check if container is warm (`min_containers=1`)
236
+ - Reduce `max_tokens`
237
+ - Reduce `top_k`
238
+ - Check network latency
239
+
240
+ ### Errors
241
+ - Verify documents are indexed
242
+ - Check Modal app status
243
+ - Review error messages in response
244
+
docs/deployment/ADD_GUIDES_TO_RAG.md DELETED
@@ -1,146 +0,0 @@
1
- # RAG Indexing Configuration
2
-
3
- ## Overview
4
-
5
- The RAG system indexes **only Word, PDF, and Excel files** containing product design information. **All markdown files are excluded** from indexing to keep the RAG focused on structured product documents.
6
-
7
- ## Currently Indexed Files
8
-
9
- The system automatically indexes files that match these patterns:
10
-
11
- 1. **Word Documents (.docx):**
12
- - Files with `tokyo_auto_insurance` or `product_design` in the filename
13
- - Example: `tokyo_auto_insurance_product_design.docx`
14
-
15
- 2. **PDF Documents (.pdf):**
16
- - Files with `tokyo_auto_insurance` or `product_design` in the filename
17
- - Example: `tokyo_auto_insurance_product_design.pdf`
18
-
19
- 3. **Excel Spreadsheets (.xlsx, .xls):**
20
- - Files with `tokyo_auto_insurance` or `product_design` in the filename
21
- - Example: `tokyo_auto_insurance_product_design.xlsx`
22
-
23
- ## Excluded Files
24
-
25
- The following files are **NOT indexed**:
26
-
27
- - ❌ **All markdown files** (`.md`, `.markdown`) - completely excluded
28
- - ❌ Guide files (e.g., `QUICK_START_RAG.md`, `PRODUCT_DECISION_GUIDE.md`)
29
- - ❌ Setup guides (e.g., `setup_product_design_rag.md`)
30
- - ❌ Troubleshooting guides
31
- - ❌ Web interface guides
32
- - ❌ Any other file types (`.txt`, `.csv`, `.json`, etc.)
33
-
34
- ## Files That Will Be Indexed
35
-
36
- Based on the current repository structure:
37
-
38
- βœ… **Will be indexed (if uploaded to Modal volume):**
39
- - `tokyo_auto_insurance_product_design.docx` (Word document)
40
- - `tokyo_auto_insurance_product_design.pdf` (PDF document)
41
- - `tokyo_auto_insurance_product_design.xlsx` (Excel spreadsheet)
42
- - `tokyo_auto_insurance_product_design.xls` (Excel 97-2003)
43
-
44
- ❌ **Will NOT be indexed (all excluded):**
45
- - `tokyo_auto_insurance_product_design.md` (markdown - excluded)
46
- - `tokyo_auto_insurance_product_design_filled.md` (markdown - excluded)
47
- - `QUICK_START_RAG.md` (markdown - excluded)
48
- - `PRODUCT_DECISION_GUIDE.md` (markdown - excluded)
49
- - `setup_product_design_rag.md` (markdown - excluded)
50
- - `TROUBLESHOOTING.md` (markdown - excluded)
51
- - `WEB_INTERFACE.md` (markdown - excluded)
52
- - All other markdown and non-supported file types
53
-
54
- ## How to Add More Product Design Files
55
-
56
- ### Option 1: Use Supported File Formats
57
- Convert your files to one of the supported formats:
58
- - **Word**: `.docx` format
59
- - **PDF**: `.pdf` format
60
- - **Excel**: `.xlsx` or `.xls` format
61
-
62
- **Important:**
63
- - The file must contain `tokyo_auto_insurance` **OR** `product_design` in the filename
64
- - Markdown files (`.md`) are **not supported** and will be ignored
65
-
66
- ### Option 2: Update the Loader
67
- Edit `src/rag/modal-rag-product-design.py` and modify the pattern matching:
68
-
69
- ```python
70
- # Current pattern for PDF files (line ~81):
71
- if 'tokyo_auto_insurance' in file_lower or 'product_design' in file_lower:
72
- pdf_files.append(full_path)
73
-
74
- # To add more patterns, modify to:
75
- if ('tokyo_auto_insurance' in file_lower or
76
- 'product_design' in file_lower or
77
- 'your_custom_pattern' in file_lower):
78
- pdf_files.append(full_path)
79
- ```
80
-
81
- **Note:** All markdown files are intentionally excluded. Only Word, PDF, and Excel files are processed.
82
-
83
- ## Uploading to Modal Volume
84
-
85
- To index product design documents, upload **only Word, PDF, or Excel files** to the Modal volume:
86
-
87
- ```bash
88
- # Upload Word document
89
- modal volume put mcp-hack-ins-products \
90
- docs/product-design/tokyo_auto_insurance_product_design.docx \
91
- docs/product-design/tokyo_auto_insurance_product_design.docx
92
-
93
- # Upload PDF document (if you have one)
94
- modal volume put mcp-hack-ins-products \
95
- docs/product-design/tokyo_auto_insurance_product_design.pdf \
96
- docs/product-design/tokyo_auto_insurance_product_design.pdf
97
-
98
- # Upload Excel spreadsheet (if you have one)
99
- modal volume put mcp-hack-ins-products \
100
- docs/product-design/tokyo_auto_insurance_product_design.xlsx \
101
- docs/product-design/tokyo_auto_insurance_product_design.xlsx
102
- ```
103
-
104
- **Important Notes:**
105
- - ❌ **Do NOT upload markdown files** (`.md`) - they will be ignored
106
- - βœ… Only `.docx`, `.pdf`, `.xlsx`, and `.xls` files are processed
107
- - βœ… Files must contain `tokyo_auto_insurance` or `product_design` in the filename
108
-
109
- ## Re-indexing
110
-
111
- After uploading new files, re-index:
112
-
113
- ```bash
114
- # Using CLI
115
- python src/web/query_product_design.py --index
116
-
117
- # Or direct Modal command
118
- modal run src/rag/modal-rag-product-design.py::index_product_design
119
- ```
120
-
121
- ## Benefits of Current Approach
122
-
123
- By focusing only on Word, PDF, and Excel files:
124
- - βœ… RAG answers are focused on structured product documents
125
- - βœ… No confusion from markdown guide/instruction content
126
- - βœ… Faster retrieval (smaller, more focused document set)
127
- - βœ… More accurate product-related answers from official documents
128
- - βœ… Better handling of tables and structured data (Excel, Word tables)
129
- - βœ… Cleaner source citations
130
- - βœ… Support for professional document formats
131
-
132
- ## Example Queries
133
-
134
- With product design documents indexed, you can ask:
135
-
136
- ```
137
- "What are the three product tiers and their premium ranges?"
138
- "What is the Year 3 premium volume projection?"
139
- "What are the FSA licensing requirements?"
140
- "What coverage does the Standard tier include?"
141
- "What is the target market size in Tokyo?"
142
- "Who are the main competitors?"
143
- ```
144
-
145
- The RAG system will retrieve relevant sections from the product design documents only, ensuring answers are focused on product information.
146
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/guides/HOW_TO_RUN.md DELETED
@@ -1,215 +0,0 @@
1
- # How to Run the Fine-Tuning Pipeline
2
-
3
- This guide walks you through the complete pipeline from data generation to model deployment.
4
-
5
- ---
6
-
7
- ## πŸ“Š Dataset Generation Results
8
-
9
- ### Final Statistics
10
- - **Training Samples**: 201,651
11
- - **Validation Samples**: 22,407
12
- - **Total Dataset**: 224,058 high-quality QA pairs
13
- - **Improvement**: 150x more data than previous approach
14
-
15
- ### Batch Performance
16
- | Batch | Files | Data Points | Status |
17
- |-------|-------|-------------|--------|
18
- | 1 | 1,000 | 100,611 | βœ… Excellent |
19
- | 2 | 1,000 | 39,960 | βœ… Good |
20
- | 3 | 1,000 | 0 | ⚠️ Complex files |
21
- | 4 | 1,000 | 600 | ⚠️ Runner issue |
22
- | 5 | 1,000 | 54,627 | βœ… Excellent |
23
- | 6 | 1,000 | 5,400 | βœ… Good |
24
- | 7 | 888 | 22,860 | βœ… Good |
25
-
26
- ---
27
-
28
- ## πŸš€ Step-by-Step Instructions
29
-
30
- ### Step 1: Fine-Tune the Model
31
-
32
- Run the fine-tuning job on Modal with H200 GPU:
33
-
34
- ```bash
35
- cd /Users/veeru/agents/mcp-hack
36
-
37
- # Start fine-tuning in detached mode
38
- ./venv/bin/modal run --detach docs/finetune_modal.py
39
- ```
40
-
41
- **What happens:**
42
- - Loads 201,651 training samples from `finetune-dataset` volume
43
- - Trains Phi-3-mini-4k-instruct with LoRA on H200 GPU
44
- - Runs for ~90-120 minutes
45
- - Saves model to `model-checkpoints` volume
46
-
47
- **Monitor progress:**
48
- ```bash
49
- # View live logs
50
- modal app logs mcp-hack::finetune-phi3-modal
51
- ```
52
-
53
- ---
54
-
55
- ### Step 2: Evaluate the Model
56
-
57
- After training completes, test the model:
58
-
59
- ```bash
60
- ./venv/bin/modal run docs/eval_finetuned.py
61
- ```
62
-
63
- This will run sample questions and show the model's answers.
64
-
65
- ---
66
-
67
- ### Step 3: Deploy API Endpoint
68
-
69
- Deploy the inference API:
70
-
71
- **Option A: GPU Endpoint (A10G)**
72
- ```bash
73
- ./venv/bin/modal deploy docs/api_endpoint.py
74
- ```
75
-
76
- **Option B: CPU Endpoint**
77
- ```bash
78
- ./venv/bin/modal deploy docs/api_endpoint_cpu.py
79
- ```
80
-
81
- **Get the endpoint URL:**
82
- ```bash
83
- modal app list
84
- ```
85
-
86
- ---
87
-
88
- ### Step 4: Test the API
89
-
90
- ```bash
91
- # Example API call
92
- curl -X POST https://YOUR-MODAL-URL/ask \
93
- -H "Content-Type: application/json" \
94
- -d '{
95
- "question": "What is the population of Tokyo?",
96
- "context": "Japan Census data"
97
- }'
98
- ```
99
-
100
- ---
101
-
102
- ## πŸ“ Key Files
103
-
104
- ### Data Processing
105
- - `docs/prepare_finetune_data.py` - Generates dataset from CSV files
106
- - `docs/clean_sample.py` - Local testing script for data cleaning
107
-
108
- ### Model Training
109
- - `docs/finetune_modal.py` - Fine-tuning script (H200 GPU)
110
- - `docs/eval_finetuned.py` - Evaluation script
111
-
112
- ### API Deployment
113
- - `docs/api_endpoint.py` - GPU inference endpoint (A10G)
114
- - `docs/api_endpoint_cpu.py` - CPU inference endpoint
115
-
116
- ### Documentation
117
- - `diagrams/finetuning.svg` - Visual pipeline diagram
118
- - `finetune/04-evaluation.md` - Evaluation results
119
-
120
- ---
121
-
122
- ## πŸ”§ Modal Volumes
123
-
124
- The pipeline uses these Modal volumes:
125
-
126
- | Volume | Purpose | Size |
127
- |--------|---------|------|
128
- | `census-data` | Raw census CSV files | 6,838 files |
129
- | `economy-labor-data` | Raw economy CSV files | 50 files |
130
- | `finetune-dataset` | Generated JSONL training data | 224K samples |
131
- | `model-checkpoints` | Fine-tuned model weights | ~7GB |
132
-
133
- ---
134
-
135
- ## πŸ’‘ Tips
136
-
137
- ### If Training Fails
138
- ```bash
139
- # Check logs for errors
140
- modal app logs mcp-hack::finetune-phi3-modal
141
-
142
- # Restart training
143
- ./venv/bin/modal run --detach docs/finetune_modal.py
144
- ```
145
-
146
- ### If You Need to Regenerate Data
147
- ```bash
148
- # Clear existing dataset
149
- ./venv/bin/modal run docs/clear_dataset.py
150
-
151
- # Regenerate with new logic
152
- ./venv/bin/modal run --detach docs/prepare_finetune_data.py
153
- ```
154
-
155
- ### View Volume Contents
156
- ```bash
157
- # List files in a volume
158
- modal volume ls finetune-dataset
159
-
160
- # Download a file
161
- modal volume get finetune-dataset train.jsonl finetune/train.jsonl
162
- ```
163
-
164
- ---
165
-
166
- ## πŸ“ˆ Expected Timeline
167
-
168
- | Step | Duration | Notes |
169
- |------|----------|-------|
170
- | Data Generation | βœ… Complete | 224K samples ready |
171
- | Fine-Tuning | ~90-120 min | H200 GPU |
172
- | Evaluation | ~5 min | Quick tests |
173
- | API Deployment | ~2 min | Instant after deploy |
174
-
175
- ---
176
-
177
- ## 🎯 Next Steps
178
-
179
- 1. **Run fine-tuning** (see Step 1 above)
180
- 2. **Wait for completion** (~2 hours)
181
- 3. **Evaluate results** (see Step 2)
182
- 4. **Deploy API** (see Step 3)
183
- 5. **Test with real queries** (see Step 4)
184
-
185
- ---
186
-
187
- ## πŸ“ž Troubleshooting
188
-
189
- **Issue**: "Volume not found"
190
- ```bash
191
- # List all volumes
192
- modal volume list
193
- ```
194
-
195
- **Issue**: "Out of memory during training"
196
- - Reduce `per_device_train_batch_size` in `finetune_modal.py`
197
- - Current: 2 (already optimized for H200)
198
-
199
- **Issue**: "Model not loading in API"
200
- - Ensure fine-tuning completed successfully
201
- - Check `model-checkpoints` volume has files
202
-
203
- ---
204
-
205
- ## βœ… Success Criteria
206
-
207
- After completing all steps, you should have:
208
- - βœ… Fine-tuned Phi-3-mini model
209
- - βœ… Deployed API endpoint
210
- - βœ… Model answering questions about Japanese census/economy data
211
- - βœ… Improved accuracy over base model
212
-
213
- ---
214
-
215
- **Ready to start?** Run the fine-tuning command from Step 1!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/guides/SETUP_SUCCESS.md DELETED
@@ -1,63 +0,0 @@
1
- # βœ… RAG Setup Successful!
2
-
3
- ## Status: Working
4
-
5
- The product design RAG system is now fully operational!
6
-
7
- ### What Was Fixed
8
-
9
- 1. **File Detection**: Updated to find files in both root and `docs/` subdirectory
10
- 2. **GPU Fallback**: Added CPU fallback for embeddings (works without GPU)
11
- 3. **Word Document**: Markdown file works perfectly (Word file has python-docx issue but markdown has all content)
12
- 4. **Modal Command**: Auto-detects Modal in venv
13
-
14
- ### Current Status
15
-
16
- βœ… **Indexed**: 1 document (markdown), 56 chunks
17
- βœ… **Vector DB**: Created in ChromaDB collection `product_design`
18
- βœ… **Queries**: Working! Tested successfully
19
-
20
- ### Test Results
21
-
22
- ```bash
23
- $ python3 query_product_design.py --query "What are the three product tiers?"
24
- ```
25
-
26
- **Result**: βœ… Successfully retrieved and answered!
27
-
28
- ## Usage
29
-
30
- ### Query the Document
31
-
32
- ```bash
33
- # Single query
34
- python3 query_product_design.py --query "What are the three product tiers?"
35
-
36
- # Interactive mode
37
- python3 query_product_design.py --interactive
38
- ```
39
-
40
- ### Example Questions
41
-
42
- - "What are the three product tiers and their premium ranges?"
43
- - "What is the Year 3 premium volume projection?"
44
- - "What coverage does the Standard tier include?"
45
- - "What are the FSA licensing requirements?"
46
-
47
- ## Known Issues
48
-
49
- 1. **Word Document**: The `.docx` file has a python-docx compatibility issue with Modal volumes, but the markdown file contains all the same content and works perfectly.
50
-
51
- 2. **Answer Truncation**: Some answers may be truncated. This is normal - the system retrieves the most relevant chunks and generates concise answers.
52
-
53
- ## Next Steps
54
-
55
- 1. βœ… **Indexing**: Complete
56
- 2. βœ… **Query System**: Working
57
- 3. 🎯 **Ready to Use**: You can now query the product design document!
58
-
59
- Try it:
60
- ```bash
61
- python3 query_product_design.py --interactive
62
- ```
63
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/guides/SUMMARY.md DELETED
@@ -1,114 +0,0 @@
1
- # βœ… Complete Setup Summary
2
-
3
- ## What Was Accomplished
4
-
5
- ### 1. Product Design Document βœ…
6
- - **Created**: Comprehensive 1,600-line product design document
7
- - **Filled**: All sections with realistic fictional data for "TokyoDrive Insurance"
8
- - **Formats**:
9
- - Markdown: `docs/tokyo_auto_insurance_product_design_filled.md`
10
- - Word: `docs/tokyo_auto_insurance_product_design.docx`
11
- - **Content**: 12 comprehensive sections covering all aspects of product design
12
-
13
- ### 2. RAG System Extension βœ…
14
- - **Created**: `src/modal-rag-product-design.py`
15
- - **Features**:
16
- - Supports Markdown and Word documents
17
- - Separate ChromaDB collection (doesn't interfere with existing RAG)
18
- - GPU-accelerated with Phi-3 model
19
- - Integrated with existing Modal infrastructure
20
-
21
- ### 3. Query Interface βœ…
22
- - **Created**: `query_product_design.py` - Simple CLI tool
23
- - **Features**:
24
- - Interactive mode for continuous queries
25
- - Single query mode
26
- - Index command
27
- - Clean, formatted output
28
-
29
- ### 4. Documentation βœ…
30
- - `docs/QUICK_START_RAG.md` - Quick start guide
31
- - `docs/setup_product_design_rag.md` - Detailed setup
32
- - `docs/next_steps_rag_recommendation.md` - Decision guide
33
- - `docs/RAG_SETUP_COMPLETE.md` - Complete setup info
34
- - `README_RAG.md` - Quick reference
35
-
36
- ## File Structure
37
-
38
- ```
39
- mcp-hack/
40
- β”œβ”€β”€ src/
41
- β”‚ └── modal-rag-product-design.py # Extended RAG system
42
- β”œβ”€β”€ query_product_design.py # CLI query interface
43
- β”œβ”€β”€ docs/
44
- β”‚ β”œβ”€β”€ tokyo_auto_insurance_product_design_filled.md
45
- β”‚ β”œβ”€β”€ tokyo_auto_insurance_product_design.docx
46
- β”‚ β”œβ”€β”€ QUICK_START_RAG.md
47
- β”‚ β”œβ”€β”€ setup_product_design_rag.md
48
- β”‚ β”œβ”€β”€ next_steps_rag_recommendation.md
49
- β”‚ β”œβ”€β”€ RAG_SETUP_COMPLETE.md
50
- β”‚ └── SUMMARY.md (this file)
51
- └── README_RAG.md # Quick reference
52
- ```
53
-
54
- ## Next Steps to Use
55
-
56
- ### Step 1: Index Documents (One-Time)
57
- ```bash
58
- python query_product_design.py --index
59
- ```
60
- ⏱️ Takes 2-5 minutes
61
-
62
- ### Step 2: Query the Document
63
- ```bash
64
- # Single query
65
- python query_product_design.py --query "What are the three product tiers?"
66
-
67
- # Interactive mode
68
- python query_product_design.py --interactive
69
- ```
70
-
71
- ## Example Use Cases
72
-
73
- ### For Development
74
- - Extract technical requirements
75
- - Get API specifications
76
- - Understand system architecture
77
-
78
- ### For Sales/Marketing
79
- - Get pricing information
80
- - Understand product features
81
- - Compare tiers
82
-
83
- ### For Compliance
84
- - Check regulatory requirements
85
- - Get licensing info
86
- - Understand data privacy rules
87
-
88
- ### For Financial Planning
89
- - Get projections
90
- - Understand cost structure
91
- - Check break-even analysis
92
-
93
- ## Key Features
94
-
95
- βœ… **Comprehensive Document**: 12 sections, 1,600 lines, fully filled with realistic data
96
- βœ… **RAG System**: Semantic search + LLM for intelligent Q&A
97
- βœ… **Easy Interface**: Simple CLI tool, no complex setup
98
- βœ… **Fast Queries**: 3-5 seconds after initial warm-up
99
- βœ… **Separate Collection**: Doesn't interfere with existing insurance products RAG
100
-
101
- ## Status
102
-
103
- πŸŽ‰ **Everything is ready!**
104
-
105
- 1. βœ… Product design document created and filled
106
- 2. βœ… Documents uploaded to Modal volume
107
- 3. βœ… RAG system extended
108
- 4. βœ… Query interface created
109
- 5. βœ… Documentation complete
110
-
111
- **Ready to index and query!**
112
-
113
- Run: `python query_product_design.py --index`
114
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/guides/modal-rag-optimization.md DELETED
@@ -1,370 +0,0 @@
1
- # Modal RAG Performance Optimization Guide
2
-
3
- **Current Performance**: >1 minute per query
4
- **Target Performance**: <5 seconds per query
5
-
6
- ## πŸ” Performance Bottleneck Analysis
7
-
8
- ### Current Architecture Issues
9
-
10
- 1. **Model Loading Time** (~30-45 seconds)
11
- - Mistral-7B (13GB) loads on every cold start
12
- - Embedding model loads separately
13
- - No model caching between requests
14
-
15
- 2. **LLM Inference Time** (~15-30 seconds)
16
- - Mistral-7B is slow for inference
17
- - Running on A10G GPU (good, but model is large)
18
- - No inference optimization (quantization, etc.)
19
-
20
- 3. **Network Latency** (~2-5 seconds)
21
- - Remote ChromaDB calls
22
- - Modal container communication overhead
23
-
24
- ---
25
-
26
- ## πŸš€ Optimization Strategies (Ranked by Impact)
27
-
28
- ### 1. **Keep Containers Warm** ⭐⭐⭐⭐⭐
29
- **Impact**: Eliminates 30-45s cold start time
30
-
31
- **Current**:
32
- ```python
33
- min_containers=1 # Already doing this βœ…
34
- ```
35
-
36
- **Why it helps**: Your container stays loaded with models in memory. First query after deployment is slow, but subsequent queries are fast.
37
-
38
- **Cost**: ~$0.50-1.00/hour for warm A10G container
39
-
40
- ---
41
-
42
- ### 2. **Switch to Smaller/Faster LLM** ⭐⭐⭐⭐⭐
43
- **Impact**: Reduces inference from 15-30s to 2-5s
44
-
45
- **Options**:
46
-
47
- #### Option A: Mistral-7B-Instruct-v0.2 (Quantized)
48
- ```python
49
- from transformers import AutoModelForCausalLM, BitsAndBytesConfig
50
-
51
- quantization_config = BitsAndBytesConfig(
52
- load_in_4bit=True,
53
- bnb_4bit_compute_dtype=torch.float16,
54
- bnb_4bit_use_double_quant=True,
55
- bnb_4bit_quant_type="nf4"
56
- )
57
-
58
- self.model = AutoModelForCausalLM.from_pretrained(
59
- LLM_MODEL,
60
- quantization_config=quantization_config,
61
- device_map="auto"
62
- )
63
- ```
64
- - **Speed**: 3-5x faster (5-10s β†’ 1-3s)
65
- - **Quality**: Minimal degradation
66
- - **Memory**: 13GB β†’ 3.5GB
67
-
68
- #### Option B: Switch to Phi-3-mini (3.8B)
69
- ```python
70
- LLM_MODEL = "microsoft/Phi-3-mini-4k-instruct"
71
- ```
72
- - **Speed**: 5-10x faster than Mistral-7B
73
- - **Quality**: Good for RAG tasks
74
- - **Memory**: ~8GB β†’ 4GB
75
- - **Inference**: 2-4 seconds
76
-
77
- #### Option C: Use TinyLlama-1.1B
78
- ```python
79
- LLM_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
80
- ```
81
- - **Speed**: 10-20x faster
82
- - **Quality**: Lower, but acceptable for simple queries
83
- - **Memory**: ~2GB
84
- - **Inference**: <1 second
85
-
86
- ---
87
-
88
- ### 3. **Use vLLM for Inference** ⭐⭐⭐⭐
89
- **Impact**: 2-5x faster inference
90
-
91
- ```python
92
- # Install vLLM
93
- image = modal.Image.debian_slim(python_version="3.11").pip_install(
94
- "vllm==0.6.0",
95
- # ... other packages
96
- )
97
-
98
- # In RAGModel.enter()
99
- from vllm import LLM, SamplingParams
100
-
101
- self.llm_engine = LLM(
102
- model=LLM_MODEL,
103
- tensor_parallel_size=1,
104
- gpu_memory_utilization=0.9,
105
- max_model_len=2048 # Shorter context for speed
106
- )
107
-
108
- # In query method
109
- sampling_params = SamplingParams(
110
- temperature=0.7,
111
- max_tokens=256,
112
- top_p=0.9
113
- )
114
- outputs = self.llm_engine.generate([prompt], sampling_params)
115
- ```
116
-
117
- **Benefits**:
118
- - Continuous batching
119
- - PagedAttention (efficient memory)
120
- - Optimized CUDA kernels
121
- - 2-5x faster than HuggingFace pipeline
122
-
123
- ---
124
-
125
- ### 4. **Optimize Embedding Generation** ⭐⭐⭐
126
- **Impact**: Reduces query embedding time from 1-2s to 0.2-0.5s
127
-
128
- #### Option A: Use Smaller Embedding Model
129
- ```python
130
- EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
131
- # 384 dimensions vs 384 (bge-small is already good)
132
- ```
133
-
134
- #### Option B: Use ONNX Runtime
135
- ```python
136
- from optimum.onnxruntime import ORTModelForFeatureExtraction
137
-
138
- self.embeddings = ORTModelForFeatureExtraction.from_pretrained(
139
- EMBEDDING_MODEL,
140
- export=True,
141
- provider="CUDAExecutionProvider"
142
- )
143
- ```
144
- - **Speed**: 2-3x faster
145
- - **Quality**: Identical
146
-
147
- ---
148
-
149
- ### 5. **Reduce Context Window** ⭐⭐⭐
150
- **Impact**: Faster LLM processing
151
-
152
- ```python
153
- # In query method
154
- sampling_params = SamplingParams(
155
- max_tokens=128, # Instead of 256 or 512
156
- temperature=0.7
157
- )
158
-
159
- # Reduce retrieved documents
160
- top_k = 2 # Instead of 3
161
- ```
162
-
163
- **Why**: Less tokens to process = faster inference
164
-
165
- ---
166
-
167
- ### 6. **Cache ChromaDB Queries** ⭐⭐
168
- **Impact**: Saves 1-2s on repeated queries
169
-
170
- ```python
171
- from functools import lru_cache
172
- import hashlib
173
-
174
- @lru_cache(maxsize=100)
175
- def get_cached_docs(query_hash):
176
- return self.retriever.get_relevant_documents(query)
177
-
178
- # In query method
179
- query_hash = hashlib.md5(question.encode()).hexdigest()
180
- docs = get_cached_docs(query_hash)
181
- ```
182
-
183
- ---
184
-
185
- ### 7. **Use Faster GPU** ⭐⭐
186
- **Impact**: 1.5-2x faster inference
187
-
188
- ```python
189
- @app.cls(
190
- gpu="A100", # Instead of A10G
191
- # or
192
- gpu="H100", # Even faster
193
- )
194
- ```
195
-
196
- **Cost**: A100 is 2-3x more expensive than A10G
197
-
198
- ---
199
-
200
- ### 8. **Parallel Processing** ⭐⭐
201
- **Impact**: Overlap embedding + retrieval
202
-
203
- ```python
204
- import asyncio
205
-
206
- async def query_async(self, question: str):
207
- # Run embedding and LLM prep in parallel
208
- embedding_task = asyncio.create_task(
209
- self.get_query_embedding(question)
210
- )
211
-
212
- # ... rest of async pipeline
213
- ```
214
-
215
- ---
216
-
217
- ## 🎯 Recommended Implementation Plan
218
-
219
- ### Phase 1: Quick Wins (Get to <10s)
220
- 1. βœ… **Keep containers warm** (already done)
221
- 2. **Add 4-bit quantization** to Mistral-7B
222
- 3. **Reduce max_tokens** to 128
223
- 4. **Use top_k=2** instead of 3
224
-
225
- **Expected**: 60s β†’ 8-12s
226
-
227
- ---
228
-
229
- ### Phase 2: Major Speedup (Get to <5s)
230
- 1. **Switch to vLLM** for inference
231
- 2. **Use Phi-3-mini** instead of Mistral-7B
232
- 3. **Optimize embeddings** with ONNX
233
-
234
- **Expected**: 8-12s β†’ 3-5s
235
-
236
- ---
237
-
238
- ### Phase 3: Ultra-Fast (Get to <2s)
239
- 1. **Use TinyLlama** for simple queries
240
- 2. **Implement query caching**
241
- 3. **Upgrade to A100 GPU**
242
-
243
- **Expected**: 3-5s β†’ 1-2s
244
-
245
- ---
246
-
247
- ## πŸ“Š Performance Comparison Table
248
-
249
- | Configuration | Cold Start | Warm Query | Cost/Hour | Quality |
250
- |--------------|------------|------------|-----------|---------|
251
- | **Current** (Mistral-7B, A10G) | 45s | 15-30s | $0.50 | ⭐⭐⭐⭐⭐ |
252
- | **Phase 1** (Quantized, warm) | 30s | 8-12s | $0.50 | ⭐⭐⭐⭐ |
253
- | **Phase 2** (vLLM + Phi-3) | 20s | 3-5s | $0.50 | ⭐⭐⭐⭐ |
254
- | **Phase 3** (TinyLlama, A100) | 10s | 1-2s | $1.50 | ⭐⭐⭐ |
255
-
256
- ---
257
-
258
- ## πŸ”§ Code Changes for Phase 2 (Recommended)
259
-
260
- ### 1. Update model configuration
261
- ```python
262
- LLM_MODEL = "microsoft/Phi-3-mini-4k-instruct"
263
- EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5" # Keep same
264
- ```
265
-
266
- ### 2. Add vLLM to dependencies
267
- ```python
268
- image = modal.Image.debian_slim(python_version="3.11").pip_install(
269
- "vllm==0.6.0",
270
- "langchain==0.3.7",
271
- # ... rest
272
- )
273
- ```
274
-
275
- ### 3. Update RAGModel.enter()
276
- ```python
277
- from vllm import LLM, SamplingParams
278
-
279
- self.llm_engine = LLM(
280
- model=LLM_MODEL,
281
- tensor_parallel_size=1,
282
- gpu_memory_utilization=0.85,
283
- max_model_len=2048
284
- )
285
-
286
- self.sampling_params = SamplingParams(
287
- temperature=0.7,
288
- max_tokens=128,
289
- top_p=0.9
290
- )
291
- ```
292
-
293
- ### 4. Update query method
294
- ```python
295
- # Build prompt
296
- prompt = f"""Use the following context to answer the question.
297
-
298
- Context: {context}
299
-
300
- Question: {question}
301
-
302
- Answer:"""
303
-
304
- # Generate with vLLM
305
- outputs = self.llm_engine.generate([prompt], self.sampling_params)
306
- answer = outputs[0].outputs[0].text
307
- ```
308
-
309
- ---
310
-
311
- ## πŸ’° Cost vs Performance Trade-offs
312
-
313
- | Approach | Speed Gain | Cost Change | Implementation |
314
- |----------|-----------|-------------|----------------|
315
- | Quantization | 3-5x | $0 | Easy |
316
- | vLLM | 2-5x | $0 | Medium |
317
- | Smaller model | 5-10x | $0 | Easy |
318
- | A100 GPU | 1.5-2x | +200% | Easy |
319
- | Caching | Variable | $0 | Medium |
320
-
321
- ---
322
-
323
- ## 🎬 Next Steps
324
-
325
- 1. **Measure current performance** with logging
326
- 2. **Implement Phase 1** (quantization + reduce tokens)
327
- 3. **Test and measure** improvement
328
- 4. **Implement Phase 2** if needed (vLLM + Phi-3)
329
- 5. **Monitor** and iterate
330
-
331
- ---
332
-
333
- ## πŸ“ Performance Monitoring Code
334
-
335
- Add this to track performance:
336
-
337
- ```python
338
- import time
339
-
340
- @modal.method()
341
- def query(self, question: str, top_k: int = 2):
342
- start = time.time()
343
-
344
- # Embedding time
345
- embed_start = time.time()
346
- retriever = self.RemoteChromaRetriever(...)
347
- embed_time = time.time() - embed_start
348
-
349
- # Retrieval time
350
- retrieval_start = time.time()
351
- docs = retriever.get_relevant_documents(question)
352
- retrieval_time = time.time() - retrieval_start
353
-
354
- # LLM time
355
- llm_start = time.time()
356
- result = chain.invoke({"question": question})
357
- llm_time = time.time() - llm_start
358
-
359
- total_time = time.time() - start
360
-
361
- print(f"⏱️ Performance:")
362
- print(f" Embedding: {embed_time:.2f}s")
363
- print(f" Retrieval: {retrieval_time:.2f}s")
364
- print(f" LLM: {llm_time:.2f}s")
365
- print(f" Total: {total_time:.2f}s")
366
-
367
- return result
368
- ```
369
-
370
- This will help you identify the exact bottleneck!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/guides/modal-rag-sequence.md DELETED
@@ -1,168 +0,0 @@
1
- # Modal RAG System - Sequence Diagrams
2
-
3
- This document provides sequence diagrams for the Modal RAG (Retrieval Augmented Generation) application.
4
-
5
- ## 1. Indexing Flow (create_vector_db)
6
-
7
- ```mermaid
8
- sequenceDiagram
9
- participant User
10
- participant Modal
11
- participant CreateVectorDB as create_vector_db()
12
- participant PDFLoader
13
- participant TextSplitter
14
- participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
15
- participant ChromaDB as Remote ChromaDB
16
-
17
- User->>Modal: modal run modal-rag.py::index
18
- Modal->>CreateVectorDB: Execute function
19
-
20
- CreateVectorDB->>PDFLoader: Load PDFs from /insurance-data
21
- PDFLoader-->>CreateVectorDB: Return documents
22
-
23
- CreateVectorDB->>TextSplitter: Split documents (chunk_size=1000)
24
- TextSplitter-->>CreateVectorDB: Return chunks
25
-
26
- CreateVectorDB->>Embeddings: Initialize (device='cuda')
27
- CreateVectorDB->>Embeddings: Generate embeddings for chunks
28
- Embeddings-->>CreateVectorDB: Return embeddings
29
-
30
- CreateVectorDB->>ChromaDB: Connect to remote service
31
- CreateVectorDB->>ChromaDB: Upsert chunks + embeddings
32
- ChromaDB-->>CreateVectorDB: Confirm storage
33
-
34
- CreateVectorDB-->>Modal: Complete
35
- Modal-->>User: Success message
36
- ```
37
-
38
- ## 2. Query Flow (RAGModel.query)
39
-
40
- ```mermaid
41
- sequenceDiagram
42
- participant User
43
- participant Modal
44
- participant QueryEntrypoint as query()
45
- participant RAGModel
46
- participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
47
- participant ChromaRetriever as RemoteChromaRetriever
48
- participant ChromaDB as Remote ChromaDB
49
- participant LLM as Mistral-7B<br/>(A10G GPU)
50
- participant RAGChain as LangChain RAG
51
-
52
- User->>Modal: modal run modal-rag.py::query --question "..."
53
- Modal->>QueryEntrypoint: Execute local entrypoint
54
- QueryEntrypoint->>RAGModel: Instantiate RAGModel()
55
-
56
- Note over RAGModel: @modal.enter() lifecycle
57
- RAGModel->>Embeddings: Load embedding model (CUDA)
58
- RAGModel->>ChromaDB: Connect to remote service
59
- RAGModel->>LLM: Load Mistral-7B (A10G GPU)
60
- RAGModel->>RAGModel: Initialize RemoteChromaRetriever
61
-
62
- QueryEntrypoint->>RAGModel: query.remote(question)
63
-
64
- RAGModel->>ChromaRetriever: Create retriever instance
65
- RAGModel->>RAGChain: Build RAG chain
66
-
67
- RAGChain->>ChromaRetriever: Retrieve relevant docs
68
- ChromaRetriever->>Embeddings: embed_query(question)
69
- Embeddings-->>ChromaRetriever: Query embedding
70
- ChromaRetriever->>ChromaDB: query(embedding, k=3)
71
- ChromaDB-->>ChromaRetriever: Top-k documents
72
- ChromaRetriever-->>RAGChain: Return documents
73
-
74
- RAGChain->>LLM: Generate answer with context
75
- LLM-->>RAGChain: Generated answer
76
- RAGChain-->>RAGModel: Return result
77
-
78
- RAGModel-->>QueryEntrypoint: Return {answer, sources}
79
- QueryEntrypoint-->>User: Display answer + sources
80
- ```
81
-
82
- ## 3. Web Endpoint Flow (RAGModel.web_query)
83
-
84
- ```mermaid
85
- sequenceDiagram
86
- participant User
87
- participant Browser
88
- participant Modal as Modal Platform
89
- participant WebEndpoint as RAGModel.web_query
90
- participant QueryMethod as RAGModel.query
91
- participant RAGChain
92
- participant ChromaDB
93
- participant LLM
94
-
95
- User->>Browser: GET https://.../web_query?question=...
96
- Browser->>Modal: HTTP GET request
97
- Modal->>WebEndpoint: Route to @modal.fastapi_endpoint
98
-
99
- WebEndpoint->>QueryMethod: Call query.local(question)
100
-
101
- Note over QueryMethod,LLM: Same flow as Query diagram
102
- QueryMethod->>RAGChain: Build chain
103
- RAGChain->>ChromaDB: Retrieve docs
104
- RAGChain->>LLM: Generate answer
105
- LLM-->>QueryMethod: Return result
106
-
107
- QueryMethod-->>WebEndpoint: Return {answer, sources}
108
- WebEndpoint-->>Modal: JSON response
109
- Modal-->>Browser: HTTP 200 + JSON
110
- Browser-->>User: Display result
111
- ```
112
-
113
- ## 4. Container Lifecycle (RAGModel)
114
-
115
- ```mermaid
116
- sequenceDiagram
117
- participant Modal
118
- participant Container
119
- participant RAGModel
120
- participant GPU as A10G GPU
121
- participant Volume as Modal Volume
122
- participant ChromaDB
123
-
124
- Modal->>Container: Start container (min_containers=1)
125
- Container->>GPU: Allocate GPU
126
- Container->>Volume: Mount /insurance-data
127
-
128
- Container->>RAGModel: Call @modal.enter()
129
-
130
- Note over RAGModel: Initialization phase
131
- RAGModel->>RAGModel: Load HuggingFaceEmbeddings (CUDA)
132
- RAGModel->>ChromaDB: Connect to remote service
133
- RAGModel->>RAGModel: Load Mistral-7B (GPU)
134
- RAGModel->>RAGModel: Create RemoteChromaRetriever class
135
-
136
- RAGModel-->>Container: Ready
137
- Container-->>Modal: Container warm and ready
138
-
139
- Note over Modal,Container: Container stays warm (min_containers=1)
140
-
141
- loop Handle requests
142
- Modal->>RAGModel: Invoke query() method
143
- RAGModel-->>Modal: Return result
144
- end
145
-
146
- Note over Modal,Container: Container persists until scaled down
147
- ```
148
-
149
- ## Key Components
150
-
151
- ### Modal Configuration
152
- - **App Name**: `insurance-rag`
153
- - **Volume**: `mcp-hack-ins-products` mounted at `/insurance-data`
154
- - **GPU**: A10G for RAGModel class
155
- - **Autoscaling**: `min_containers=1`, `max_containers=1` (always warm)
156
-
157
- ### Models
158
- - **LLM**: `mistralai/Mistral-7B-Instruct-v0.3` (GPU, float16)
159
- - **Embeddings**: `BAAI/bge-small-en-v1.5` (GPU, CUDA)
160
-
161
- ### Storage
162
- - **Vector DB**: Remote ChromaDB service (`chroma-server-v2`)
163
- - **Collection**: `insurance_products`
164
- - **Chunk Size**: 1000 characters with 200 overlap
165
-
166
- ### Endpoints
167
- - **Local Entrypoints**: `list`, `index`, `query`
168
- - **Web Endpoint**: `RAGModel.web_query` (FastAPI GET endpoint)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/guides/next_steps_rag_recommendation.md DELETED
@@ -1,77 +0,0 @@
1
- # Next Steps: RAG for Product Design Document
2
-
3
- ## Should You Add RAG?
4
-
5
- **Recommendation: YES, but with specific use cases in mind**
6
-
7
- ### Benefits of Adding RAG:
8
-
9
- 1. **Requirements Extraction**: Quickly find specific requirements from the 1,600-line document
10
- 2. **Stakeholder Q&A**: Answer questions like "What's the premium for a 28-year-old in Shibuya?"
11
- 3. **Design Validation**: Query coverage details, pricing tiers, compliance requirements
12
- 4. **Development Planning**: Extract technical requirements, API specs, integration needs
13
- 5. **Competitive Analysis**: Compare your product features vs competitors mentioned in the doc
14
-
15
- ### When RAG is NOT Needed:
16
-
17
- - If you just need to read/search the document manually
18
- - If the document is small enough to navigate easily
19
- - If you don't need to answer complex questions across multiple sections
20
-
21
- ## Implementation Options
22
-
23
- ### Option 1: Extend Existing Modal RAG (Recommended)
24
- - Your existing `modal-rag.py` already handles PDFs
25
- - Can easily add support for markdown/Word documents
26
- - Leverages existing ChromaDB infrastructure
27
- - **Effort**: Low (30-60 minutes)
28
-
29
- ### Option 2: Simple Document Search
30
- - Use grep/search tools for simple queries
31
- - **Effort**: None (already available)
32
-
33
- ### Option 3: Full RAG with Fine-Tuning
34
- - Fine-tune model on insurance domain + your product spec
35
- - **Effort**: High (days/weeks)
36
- - **Benefit**: Best accuracy for insurance-specific queries
37
-
38
- ## Recommended Next Steps
39
-
40
- 1. **Add Product Design Doc to RAG** (30 min)
41
- - Extend `modal-rag.py` to load markdown/Word docs
42
- - Index the filled product design document
43
- - Test with sample queries
44
-
45
- 2. **Create Query Interface** (1-2 hours)
46
- - Simple CLI or web interface
47
- - Example queries:
48
- - "What are the three product tiers and their premium ranges?"
49
- - "What coverage does the Standard tier include?"
50
- - "What are the Year 3 financial projections?"
51
-
52
- 3. **Use Cases to Test**:
53
- - Requirements extraction for development
54
- - Pricing questions for sales team
55
- - Compliance checklist generation
56
- - Feature comparison queries
57
-
58
- ## Quick Decision Matrix
59
-
60
- | Use Case | RAG Needed? | Alternative |
61
- |----------|-------------|-------------|
62
- | Find specific section | ❌ No | Use table of contents |
63
- | Answer "What's the premium for X?" | βœ… Yes | Manual search |
64
- | Extract all requirements | βœ… Yes | Manual extraction |
65
- | Compare product tiers | βœ… Yes | Manual comparison |
66
- | Generate compliance checklist | βœ… Yes | Manual review |
67
- | Simple fact lookup | ⚠️ Maybe | Grep/search |
68
-
69
- ## Recommendation
70
-
71
- **Start with Option 1**: Extend your existing RAG to include the product design document. It's low effort, leverages existing infrastructure, and gives you the ability to query the spec as you develop the product.
72
-
73
- Would you like me to:
74
- 1. Extend `modal-rag.py` to support the product design document?
75
- 2. Create a simple query interface?
76
- 3. Both?
77
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
{scripts β†’ src}/__init__.py RENAMED
File without changes
{docs β†’ src/data}/clean_sample.py RENAMED
File without changes
{scripts β†’ src}/data/cleanup_data.py RENAMED
File without changes
{scripts β†’ src}/data/clear_census_volume.py RENAMED
File without changes
{scripts β†’ src}/data/convert_census_to_csv.py RENAMED
File without changes
{scripts β†’ src}/data/convert_economy_labor_to_csv.py RENAMED
File without changes
{scripts β†’ src}/data/convert_to_word.py RENAMED
File without changes
{scripts β†’ src}/data/create_custom_qa.py RENAMED
File without changes
{docs β†’ src/data}/debug_parser.py RENAMED
File without changes
{scripts β†’ src}/data/delete_census_csvs.py RENAMED
File without changes
{scripts β†’ src}/data/download_census_api.py RENAMED
File without changes
{scripts β†’ src}/data/download_census_csv_modal.py RENAMED
File without changes
{scripts β†’ src}/data/download_census_data.py RENAMED
File without changes
{scripts β†’ src}/data/download_census_modal.py RENAMED
File without changes
{scripts β†’ src}/data/download_economy_labor_modal.py RENAMED
File without changes
{scripts β†’ src}/data/fix_csv_filenames.py RENAMED
File without changes
{scripts β†’ src}/data/prepare_economy_data.py RENAMED
File without changes
{scripts β†’ src}/data/prepare_finetune_data.py RENAMED
File without changes
{scripts β†’ src}/data/remove_duplicate_csvs.py RENAMED
File without changes