Spaces:

MCP-1st-Birthday
/

sdlc-agent

Runtime error

App Files Files Community

Veeru-c commited on Nov 30, 2025

Commit

06bd253

1 Parent(s): a428e27

initial commit

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

MIGRATION_GUIDE.md +0 -81
NEXT_STEPS.md +0 -174
diagrams/1-indexing-flow.mmd +0 -28
diagrams/1-indexing-flow.svg +99 -1
diagrams/2-query-flow-medium.mmd +0 -25
diagrams/2-query-flow-medium.svg +106 -1
diagrams/2-query-flow-simple.mmd +0 -19
diagrams/2-query-flow-simple.svg +105 -1
diagrams/2-query-flow.mmd +0 -39
diagrams/2-query-flow.svg +138 -1
diagrams/3-web-endpoint-flow.mmd +0 -26
diagrams/3-web-endpoint-flow.svg +94 -1
diagrams/4-container-lifecycle.mmd +0 -31
diagrams/4-container-lifecycle.svg +118 -1
diagrams/finetuning.svg +101 -166
docs/NEXT_STEPS.md +181 -0
QUICK_START.md → docs/QUICK_START.md +0 -0
docs/QUICK_START_API.md +75 -0
docs/README.md +58 -0
README_RAG.md → docs/README_RAG.md +0 -0
STRUCTURE.md → docs/STRUCTURE.md +0 -0
TESTING.md → docs/TESTING.md +0 -0
VLLM_MIGRATION.md → docs/VLLM_MIGRATION.md +0 -0
docs/api/RAG_API.md +244 -0
docs/deployment/ADD_GUIDES_TO_RAG.md +0 -146
docs/guides/HOW_TO_RUN.md +0 -215
docs/guides/SETUP_SUCCESS.md +0 -63
docs/guides/SUMMARY.md +0 -114
docs/guides/modal-rag-optimization.md +0 -370
docs/guides/modal-rag-sequence.md +0 -168
docs/guides/next_steps_rag_recommendation.md +0 -77
{scripts → src}/__init__.py +0 -0
{docs → src/data}/clean_sample.py +0 -0
{scripts → src}/data/cleanup_data.py +0 -0
{scripts → src}/data/clear_census_volume.py +0 -0
{scripts → src}/data/convert_census_to_csv.py +0 -0
{scripts → src}/data/convert_economy_labor_to_csv.py +0 -0
{scripts → src}/data/convert_to_word.py +0 -0
{scripts → src}/data/create_custom_qa.py +0 -0
{docs → src/data}/debug_parser.py +0 -0
{scripts → src}/data/delete_census_csvs.py +0 -0
{scripts → src}/data/download_census_api.py +0 -0
{scripts → src}/data/download_census_csv_modal.py +0 -0
{scripts → src}/data/download_census_data.py +0 -0
{scripts → src}/data/download_census_modal.py +0 -0
{scripts → src}/data/download_economy_labor_modal.py +0 -0
{scripts → src}/data/fix_csv_filenames.py +0 -0
{scripts → src}/data/prepare_economy_data.py +0 -0
{scripts → src}/data/prepare_finetune_data.py +0 -0
{scripts → src}/data/remove_duplicate_csvs.py +0 -0

MIGRATION_GUIDE.md DELETED Viewed

@@ -1,81 +0,0 @@
-# Repository Restructure Migration Guide
-## What Changed
-The repository has been reorganized for better structure and maintainability.
-## File Moves
-### RAG System
-- `src/modal-rag.py` → `src/rag/modal-rag.py`
-- `src/modal-rag-product-design.py` → `src/rag/modal-rag-product-design.py`
-### Web Application
-- `web_app.py` → `src/web/web_app.py`
-- `query_product_design.py` → `src/web/query_product_design.py`
-- `templates/` → `src/web/templates/`
-- `static/` → `src/web/static/`
-### Scripts
-- Data processing scripts → `scripts/data/`
-- Setup scripts → `scripts/setup/`
-- Utility scripts → `scripts/tools/`
-### Documentation
-- All `.md` files → `docs/guides/`
-- Product design docs → `docs/product-design/`
-### Tests
-- `test_*.py` → `tests/`
-## Updated Commands
-### Old Commands (No longer work)
-```bash
-python web_app.py
-modal run src/modal-rag-product-design.py::query_product_design
-```
-### New Commands
-```bash
-# Web app
-python src/web/web_app.py
-# Or use helper script
-./scripts/setup/start_web.sh
-# Modal RAG
-modal run src/rag/modal-rag-product-design.py::query_product_design --question "your question"
-# Indexing
-modal run src/rag/modal-rag-product-design.py::index_product_design
-```
-## Import Path Updates
-If you have custom scripts that import from these modules, update the imports:
-```python
-# Old
-from query_product_design import query_rag
-# New
-import sys
-sys.path.insert(0, 'src/web')
-from query_product_design import query_rag
-```
-## Next Steps
-1. Update any custom scripts with new import paths
-2. Update CI/CD pipelines if applicable
-3. Update documentation references
-4. Test all functionality
-## Rollback
-If you need to rollback, all files are still in git history. You can:
-```bash
-git log --oneline --all -- "old/path/to/file"
-git checkout <commit-hash> -- "old/path/to/file"
-```

NEXT_STEPS.md DELETED Viewed

@@ -1,174 +0,0 @@
-# Next Steps
-## Current Status
-✅ **Completed:**
-- Repository restructured and organized
-- RAG system configured (Word, PDF, Excel only - no markdown)
-- Web interface functional
-- Nebius deployment guide created
-- Documentation updated
-## Immediate Next Steps
-### 1. Test the Updated RAG System
-**Upload Product Design Documents:**
-```bash
-# Upload Word document (if you have it)
-modal volume put mcp-hack-ins-products \
-  docs/product-design/tokyo_auto_insurance_product_design.docx \
-  docs/product-design/tokyo_auto_insurance_product_design.docx
-# Upload PDF (if you have one)
-modal volume put mcp-hack-ins-products \
-  docs/product-design/tokyo_auto_insurance_product_design.pdf \
-  docs/product-design/tokyo_auto_insurance_product_design.pdf
-# Upload Excel (if you have one)
-modal volume put mcp-hack-ins-products \
-  docs/product-design/tokyo_auto_insurance_product_design.xlsx \
-  docs/product-design/tokyo_auto_insurance_product_design.xlsx
-```
-**Re-index Documents:**
-```bash
-# Using CLI
-python src/web/query_product_design.py --index
-# Or direct Modal command
-modal run src/rag/modal-rag-product-design.py::index_product_design
-```
-**Test Queries:**
-```bash
-# Test via CLI
-python src/web/query_product_design.py --query "What are the three product tiers?"
-# Or start web interface
-python src/web/web_app.py
-# Then open http://127.0.0.1:5000 in browser
-```
-### 2. Verify File Processing
-Check that the system correctly:
-- ✅ Loads Word documents
-- ✅ Loads PDF documents (if uploaded)
-- ✅ Loads Excel files (if uploaded)
-- ❌ Ignores markdown files
-- ❌ Ignores other file types
-### 3. Production Readiness
-**Option A: Continue with Modal (Current Setup)**
-- ✅ Already working
-- ✅ No changes needed
-- Just ensure documents are uploaded and indexed
-**Option B: Deploy to Nebius**
-- Review: `docs/deployment/NEBIUS_DEPLOYMENT.md`
-- Set up Nebius account
-- Deploy RAG service and web app
-- Migrate from Modal to Nebius
-## Recommended Path Forward
-### Short Term (This Week)
-1. **Upload and index documents**
-   - Ensure Word/PDF/Excel files are in Modal volume
-   - Run indexing
-   - Test queries
-2. **Validate RAG quality**
-   - Ask various product questions
-   - Verify answer quality and accuracy
-   - Check source citations
-3. **Test web interface**
-   - Start web app
-   - Test from browser
-   - Verify all features work
-### Medium Term (Next 2 Weeks)
-1. **Optimize RAG performance**
-   - Monitor query times
-   - Adjust chunk sizes if needed
-   - Fine-tune retrieval parameters
-2. **Add more documents** (if needed)
-   - Upload additional product design files
-   - Re-index as needed
-3. **User testing**
-   - Share with team/stakeholders
-   - Gather feedback
-   - Iterate on improvements
-### Long Term (Next Month)
-1. **Deploy to production**
-   - Choose: Modal or Nebius
-   - Set up monitoring
-   - Configure auto-scaling (if needed)
-2. **Enhance features**
-   - Add authentication (if needed)
-   - Add query history
-   - Add export functionality
-   - Add analytics
-3. **Scale and optimize**
-   - Monitor costs
-   - Optimize for performance
-   - Add caching if needed
-## Quick Commands Reference
-```bash
-# Index documents
-python src/web/query_product_design.py --index
-# Query via CLI
-python src/web/query_product_design.py --query "your question"
-# Start web interface
-python src/web/web_app.py
-# Or use helper script:
-./scripts/setup/start_web.sh
-# Check Modal volume contents
-modal volume list mcp-hack-ins-products
-```
-## Decision Points
-1. **Deployment Platform:**
-   - [ ] Stay with Modal (current)
-   - [ ] Migrate to Nebius
-   - [ ] Use both (hybrid)
-2. **Document Management:**
-   - [ ] Keep documents in Modal volume
-   - [ ] Move to object storage (S3, etc.)
-   - [ ] Use version control
-3. **Access Control:**
-   - [ ] Public access (current)
-   - [ ] Add authentication
-   - [ ] Add role-based access
-## Questions to Consider
-- Do you have Word/PDF/Excel versions of your product design documents?
-- Do you need to convert markdown files to Word/PDF format?
-- Are you ready to deploy to production?
-- Do you need authentication/access control?
-- What's your target user base?
-## Getting Help
-- **Documentation:** See `docs/` directory
-- **Troubleshooting:** See `docs/guides/TROUBLESHOOTING.md`
-- **Deployment:** See `docs/deployment/NEBIUS_DEPLOYMENT.md`
-- **Quick Start:** See `QUICK_START.md`

diagrams/1-indexing-flow.mmd DELETED Viewed

@@ -1,28 +0,0 @@
-sequenceDiagram
-    participant User
-    participant Modal
-    participant CreateVectorDB as create_vector_db()
-    participant PDFLoader
-    participant TextSplitter
-    participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
-    participant ChromaDB as Remote ChromaDB
-    User->>Modal: modal run modal-rag.py::index
-    Modal->>CreateVectorDB: Execute function
-    CreateVectorDB->>PDFLoader: Load PDFs from /insurance-data
-    PDFLoader-->>CreateVectorDB: Return documents
-    CreateVectorDB->>TextSplitter: Split documents (chunk_size=1000)
-    TextSplitter-->>CreateVectorDB: Return chunks
-    CreateVectorDB->>Embeddings: Initialize (device='cuda')
-    CreateVectorDB->>Embeddings: Generate embeddings for chunks
-    Embeddings-->>CreateVectorDB: Return embeddings
-    CreateVectorDB->>ChromaDB: Connect to remote service
-    CreateVectorDB->>ChromaDB: Upsert chunks + embeddings
-    ChromaDB-->>CreateVectorDB: Confirm storage
-    CreateVectorDB-->>Modal: Complete
-    Modal-->>User: Success message

diagrams/1-indexing-flow.svg CHANGED Viewed

diagrams/2-query-flow-medium.mmd DELETED Viewed

@@ -1,25 +0,0 @@
-sequenceDiagram
-    participant User
-    participant Modal
-    participant RAGModel
-    participant Embeddings
-    participant ChromaDB
-    participant LLM
-    User->>Modal: modal run query --question "..."
-    Note over Modal,RAGModel: Container Startup (if cold)
-    Modal->>RAGModel: Initialize
-    RAGModel->>Embeddings: Load embedding model (GPU)
-    RAGModel->>LLM: Load Mistral-7B (GPU)
-    Note over Modal,LLM: Query Processing
-    Modal->>RAGModel: Process question
-    RAGModel->>Embeddings: Convert question to vector
-    RAGModel->>ChromaDB: Search similar documents
-    ChromaDB-->>RAGModel: Top 3 matching docs
-    RAGModel->>LLM: Generate answer + context
-    LLM-->>RAGModel: Answer
-    RAGModel-->>User: Display answer + sources

diagrams/2-query-flow-medium.svg CHANGED Viewed

diagrams/2-query-flow-simple.mmd DELETED Viewed

@@ -1,19 +0,0 @@
-sequenceDiagram
-    participant User
-    participant Modal
-    participant RAGModel
-    participant ChromaDB
-    participant LLM as Mistral-7B
-    User->>Modal: Ask question
-    Modal->>RAGModel: Initialize (warm container)
-    Note over RAGModel: Load models on GPU
-    RAGModel->>ChromaDB: Search for relevant docs
-    ChromaDB-->>RAGModel: Return top 3 documents
-    RAGModel->>LLM: Generate answer with context
-    LLM-->>RAGModel: Generated answer
-    RAGModel-->>User: Answer + Sources

diagrams/2-query-flow-simple.svg CHANGED Viewed

diagrams/2-query-flow.mmd DELETED Viewed

@@ -1,39 +0,0 @@
-sequenceDiagram
-    participant User
-    participant Modal
-    participant QueryEntrypoint as query()
-    participant RAGModel
-    participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
-    participant ChromaRetriever as RemoteChromaRetriever
-    participant ChromaDB as Remote ChromaDB
-    participant LLM as Mistral-7B<br/>(A10G GPU)
-    participant RAGChain as LangChain RAG
-    User->>Modal: modal run modal-rag.py::query --question "..."
-    Modal->>QueryEntrypoint: Execute local entrypoint
-    QueryEntrypoint->>RAGModel: Instantiate RAGModel()
-    Note over RAGModel: @modal.enter() lifecycle
-    RAGModel->>Embeddings: Load embedding model (CUDA)
-    RAGModel->>ChromaDB: Connect to remote service
-    RAGModel->>LLM: Load Mistral-7B (A10G GPU)
-    RAGModel->>RAGModel: Initialize RemoteChromaRetriever
-    QueryEntrypoint->>RAGModel: query.remote(question)
-    RAGModel->>ChromaRetriever: Create retriever instance
-    RAGModel->>RAGChain: Build RAG chain
-    RAGChain->>ChromaRetriever: Retrieve relevant docs
-    ChromaRetriever->>Embeddings: embed_query(question)
-    Embeddings-->>ChromaRetriever: Query embedding
-    ChromaRetriever->>ChromaDB: query(embedding, k=3)
-    ChromaDB-->>ChromaRetriever: Top-k documents
-    ChromaRetriever-->>RAGChain: Return documents
-    RAGChain->>LLM: Generate answer with context
-    LLM-->>RAGChain: Generated answer
-    RAGChain-->>RAGModel: Return result
-    RAGModel-->>QueryEntrypoint: Return {answer, sources}
-    QueryEntrypoint-->>User: Display answer + sources

diagrams/2-query-flow.svg CHANGED Viewed

diagrams/3-web-endpoint-flow.mmd DELETED Viewed

@@ -1,26 +0,0 @@
-sequenceDiagram
-    participant User
-    participant Browser
-    participant Modal as Modal Platform
-    participant WebEndpoint as RAGModel.web_query
-    participant QueryMethod as RAGModel.query
-    participant RAGChain
-    participant ChromaDB
-    participant LLM
-    User->>Browser: GET https://.../web_query?question=...
-    Browser->>Modal: HTTP GET request
-    Modal->>WebEndpoint: Route to @modal.fastapi_endpoint
-    WebEndpoint->>QueryMethod: Call query.local(question)
-    Note over QueryMethod,LLM: Same flow as Query diagram
-    QueryMethod->>RAGChain: Build chain
-    RAGChain->>ChromaDB: Retrieve docs
-    RAGChain->>LLM: Generate answer
-    LLM-->>QueryMethod: Return result
-    QueryMethod-->>WebEndpoint: Return {answer, sources}
-    WebEndpoint-->>Modal: JSON response
-    Modal-->>Browser: HTTP 200 + JSON
-    Browser-->>User: Display result

diagrams/3-web-endpoint-flow.svg CHANGED Viewed

diagrams/4-container-lifecycle.mmd DELETED Viewed

@@ -1,31 +0,0 @@
-sequenceDiagram
-    participant Modal
-    participant Container
-    participant RAGModel
-    participant GPU as A10G GPU
-    participant Volume as Modal Volume
-    participant ChromaDB
-    Modal->>Container: Start container (min_containers=1)
-    Container->>GPU: Allocate GPU
-    Container->>Volume: Mount /insurance-data
-    Container->>RAGModel: Call @modal.enter()
-    Note over RAGModel: Initialization phase
-    RAGModel->>RAGModel: Load HuggingFaceEmbeddings (CUDA)
-    RAGModel->>ChromaDB: Connect to remote service
-    RAGModel->>RAGModel: Load Mistral-7B (GPU)
-    RAGModel->>RAGModel: Create RemoteChromaRetriever class
-    RAGModel-->>Container: Ready
-    Container-->>Modal: Container warm and ready
-    Note over Modal,Container: Container stays warm (min_containers=1)
-    loop Handle requests
-        Modal->>RAGModel: Invoke query() method
-        RAGModel-->>Modal: Return result
-    end
-    Note over Modal,Container: Container persists until scaled down

diagrams/4-container-lifecycle.svg CHANGED Viewed

diagrams/finetuning.svg CHANGED Viewed

docs/NEXT_STEPS.md ADDED Viewed

	@@ -0,0 +1,181 @@

+# Next Steps & Roadmap
+## ✅ Current Status
+**Completed:**
+- Fine-tuning pipeline with vLLM optimization
+- RAG system with local ChromaDB
+- High-performance inference (<3s latency)
+- Model merging for production deployment
+- Comprehensive documentation
+## 🎯 Immediate Next Steps
+### 1. Test Fine-Tuned Model Performance
+```bash
+# Test the vLLM-optimized endpoint
+curl -X POST https://mcp-hack--phi3-inference-vllm-model-ask.modal.run \
+  -H "Content-Type: application/json" \
+  -d '{"question": "What is the population of Tokyo?", "context": "Japan Census data"}'
+```
+### 2. Test RAG System
+```bash
+# Test the RAG endpoint
+curl -X POST https://mcp-hack--rag-vllm-optimized-ragmodel-query.modal.run \
+  -H "Content-Type: application/json" \
+  -d '{"question": "What insurance products are available?"}'
+```
+### 3. Monitor Performance
+- Check latency metrics in responses
+- Verify <3s response times
+- Monitor GPU utilization on Modal dashboard
+## 🚀 Short Term (This Week)
+### Fine-Tuning Improvements
+- [ ] Run evaluation script to assess model quality
+- [ ] Collect more training data if needed
+- [ ] Experiment with different LoRA parameters
+- [ ] Test on diverse queries
+### RAG Enhancements
+- [ ] Add more insurance documents to volume
+- [ ] Re-index with updated documents
+- [ ] Test retrieval quality
+- [ ] Optimize chunk sizes if needed
+### Documentation
+- [ ] Add API usage examples
+- [ ] Create deployment guide
+- [ ] Document troubleshooting steps
+## 📊 Medium Term (Next 2 Weeks)
+### Model Optimization
+1. **Fine-tuning iterations**
+   - Analyze evaluation results
+   - Adjust training parameters
+   - Re-train if needed
+2. **RAG improvements**
+   - Experiment with different embedding models
+   - Optimize retrieval parameters (top-k, similarity threshold)
+   - Add query rewriting
+3. **Performance monitoring**
+   - Set up logging
+   - Track latency trends
+   - Monitor costs
+### Feature Additions
+- [ ] Add streaming responses
+- [ ] Implement caching layer
+- [ ] Add query history
+- [ ] Create admin dashboard
+## 🎨 Long Term (Next Month)
+### Production Readiness
+1. **Deployment**
+   - Set up CI/CD pipeline
+   - Configure monitoring and alerts
+   - Implement rate limiting
+   - Add authentication if needed
+2. **Scaling**
+   - Optimize container scaling
+   - Implement load balancing
+   - Add caching (Redis)
+   - Set up CDN for static assets
+3. **Advanced Features**
+   - Multi-modal support (images, tables)
+   - Batch processing
+   - A/B testing framework
+   - Analytics dashboard
+## 🔧 Technical Debt
+- [ ] Remove `bkp/` directory (old backup files)
+- [ ] Clean up unused dependencies
+- [ ] Add comprehensive tests
+- [ ] Improve error handling
+- [ ] Add input validation
+## 📈 Metrics to Track
+**Performance:**
+- Inference latency (target: <3s)
+- Retrieval accuracy
+- GPU utilization
+- Cost per query
+**Quality:**
+- Model accuracy on evaluation set
+- RAG relevance scores
+- User satisfaction (if applicable)
+## 🤔 Decision Points
+1. **Model Selection:**
+   - [ ] Continue with Phi-3-mini
+   - [ ] Experiment with larger models
+   - [ ] Try different base models
+2. **Infrastructure:**
+   - [ ] Stay with Modal (current)
+   - [ ] Migrate to other platform
+   - [ ] Self-hosted deployment
+3. **Data Strategy:**
+   - [ ] Expand training dataset
+   - [ ] Add domain-specific data
+   - [ ] Implement data versioning
+## 📚 Quick Reference
+### Key Commands
+```bash
+# Fine-tuning
+./venv/bin/modal run src/finetune/finetune_modal.py
+# Model merging
+./venv/bin/modal run src/finetune/merge_model.py
+# Deploy vLLM endpoint (fine-tuned)
+./venv/bin/modal deploy src/finetune/api_endpoint_vllm.py
+# Deploy RAG endpoint
+./venv/bin/modal deploy src/rag/rag_vllm.py
+# Evaluation
+./venv/bin/modal run src/finetune/eval_finetuned.py
+```
+### Documentation
+- **Main Guide:** `docs/HOW_TO_RUN.md`
+- **Architecture:** `diagrams/` folder
+- **Testing:** `docs/TESTING.md`
+- **Agent Design:** `docs/agentdesign.md`
+## 🎯 Success Criteria
+**Phase 1 (Current):**
+- ✅ <3s inference latency
+- ✅ vLLM optimization working
+- ✅ RAG retrieval functional
+**Phase 2 (Next):**
+- [ ] >90% accuracy on evaluation set
+- [ ] <2s average latency
+- [ ] Production deployment complete
+**Phase 3 (Future):**
+- [ ] Multi-user support
+- [ ] Advanced analytics
+- [ ] Cost optimization (<$X per 1K queries)

QUICK_START.md → docs/QUICK_START.md RENAMED Viewed

File without changes

docs/QUICK_START_API.md ADDED Viewed

	@@ -0,0 +1,75 @@

+# Quick Start: RAG API
+Fast API endpoint for querying product design documents with <3 second response times.
+## Deploy the API
+```bash
+# Deploy to Modal
+modal deploy src/rag/rag_api.py
+# Get the API URL
+modal app show insurance-rag-api
+```
+## Use the API
+### Python Client
+```python
+from src.rag.api_client import RAGAPIClient
+# Initialize client
+client = RAGAPIClient(base_url="https://your-api-url.modal.run")
+# Query
+result = client.query("What are the three product tiers?")
+print(result['answer'])
+print(f"Response time: {result['total_time']:.2f}s")
+```
+### cURL
+```bash
+curl -X POST https://your-api-url.modal.run/query \
+  -H "Content-Type: application/json" \
+  -d '{"question": "What are the three product tiers?"}'
+```
+### JavaScript
+```javascript
+const response = await fetch('https://your-api-url.modal.run/query', {
+  method: 'POST',
+  headers: { 'Content-Type': 'application/json' },
+  body: JSON.stringify({ question: 'What are the three product tiers?' })
+});
+const data = await response.json();
+console.log(data.answer);
+```
+## Test Performance
+```bash
+# Test with default URL
+python tests/test_api.py
+# Test with custom URL
+python tests/test_api.py --url https://your-api-url.modal.run
+```
+## Performance Target
+- **Target**: <3 seconds per query
+- **Typical**: 1.5-2.5 seconds
+- **Optimizations**: Warm containers, reduced tokens, limited context
+## API Endpoints
+- `GET /health` - Health check
+- `POST /query` - Query the RAG system
+- `GET /` - API information
+See `docs/api/RAG_API.md` for full documentation.

docs/README.md ADDED Viewed

	@@ -0,0 +1,58 @@

+# Documentation Index
+This directory contains all project documentation.
+## 📚 Main Guides
+### Getting Started
+- **[HOW_TO_RUN.md](HOW_TO_RUN.md)** - Complete guide to running the fine-tuning pipeline
+- **[QUICK_START.md](QUICK_START.md)** - Quick start guide for the project
+- **[QUICK_START_API.md](QUICK_START_API.md)** - API quick start guide
+### Fine-Tuning
+- **[finetune/](../finetune/)** - Fine-tuning documentation and guides
+  - Data preparation
+  - Dataset generation
+  - Model training
+  - Evaluation
+### RAG System
+- **[README_RAG.md](README_RAG.md)** - RAG system overview
+- **[guides/QUICK_START_RAG.md](guides/QUICK_START_RAG.md)** - RAG quick start
+- **[guides/RAG_SETUP_COMPLETE.md](guides/RAG_SETUP_COMPLETE.md)** - Complete RAG setup guide
+- **[api/RAG_API.md](api/RAG_API.md)** - RAG API documentation
+### Deployment
+- **[deployment/](deployment/)** - Deployment guides
+  - **[README.md](deployment/README.md)** - Deployment overview
+  - **[NEBIUS_DEPLOYMENT.md](deployment/NEBIUS_DEPLOYMENT.md)** - Nebius deployment guide
+### Reference
+- **[STRUCTURE.md](STRUCTURE.md)** - Project structure overview
+- **[TESTING.md](TESTING.md)** - Testing guide
+- **[MIGRATION_GUIDE.md](MIGRATION_GUIDE.md)** - Migration guide
+- **[VLLM_MIGRATION.md](VLLM_MIGRATION.md)** - vLLM migration guide
+- **[NEXT_STEPS.md](NEXT_STEPS.md)** - Next steps and roadmap
+### Agent Design
+- **[agentdesign.md](agentdesign.md)** - AI agent design for automated development workflow
+### Product Design
+- **[product-design/](product-design/)** - Product design guides and examples
+  - Product decision guide
+  - RAG setup for product design
+  - Example: Tokyo auto insurance product design
+## 🔧 Additional Resources
+### Data Sources
+- **[guides/estat_api_guide.md](guides/estat_api_guide.md)** - e-Stat API guide
+- **[guides/source_data.md](guides/source_data.md)** - Data source documentation
+- **[guides/ft_process.md](guides/ft_process.md)** - Fine-tuning process details
+### Troubleshooting
+- **[guides/TROUBLESHOOTING.md](guides/TROUBLESHOOTING.md)** - General troubleshooting
+- **[guides/WEB_TROUBLESHOOTING.md](guides/WEB_TROUBLESHOOTING.md)** - Web interface troubleshooting
+### Web Interface
+- **[guides/WEB_INTERFACE.md](guides/WEB_INTERFACE.md)** - Web interface documentation

README_RAG.md → docs/README_RAG.md RENAMED Viewed

File without changes

STRUCTURE.md → docs/STRUCTURE.md RENAMED Viewed

File without changes

TESTING.md → docs/TESTING.md RENAMED Viewed

File without changes

VLLM_MIGRATION.md → docs/VLLM_MIGRATION.md RENAMED Viewed

File without changes

docs/api/RAG_API.md ADDED Viewed

	@@ -0,0 +1,244 @@

+# RAG API Documentation
+Fast API endpoint for querying the product design RAG system with <3 second response times.
+## Quick Start
+### Deploy the API
+```bash
+# Deploy to Modal
+modal deploy src/rag/rag_api.py
+# Get the URL
+modal app list
+```
+### Use the API
+```python
+from src.rag.api_client import RAGAPIClient
+client = RAGAPIClient(base_url="https://your-modal-url.modal.run")
+result = client.query("What are the three product tiers?")
+print(result['answer'])
+```
+## API Endpoints
+### Health Check
+```http
+GET /health
+```
+**Response:**
+```json
+{
+  "status": "healthy",
+  "service": "rag-api"
+}
+```
+### Query
+```http
+POST /query
+Content-Type: application/json
+{
+  "question": "What are the three product tiers?",
+  "top_k": 5,
+  "max_tokens": 1024
+}
+```
+**Response:**
+```json
+{
+  "answer": "The three product tiers are...",
+  "retrieval_time": 0.45,
+  "generation_time": 1.23,
+  "total_time": 1.68,
+  "sources": [
+    {
+      "content": "...",
+      "metadata": {...}
+    }
+  ],
+  "success": true
+}
+```
+## Performance Optimization
+### Target: <3 Second Responses
+The API is optimized for fast responses:
+1. **Warm Containers**: `min_containers=1` keeps a container ready
+2. **Optimized LLM**: Reduced max_tokens (1024 vs 1536)
+3. **Limited Context**: Top 3 documents, 800 chars each
+4. **Prefix Caching**: Enabled for faster generation
+5. **Concurrent Requests**: Up to 10 concurrent requests
+### Response Time Breakdown
+- **Retrieval**: 0.3-0.8 seconds
+- **Generation**: 1.0-2.0 seconds
+- **Total**: 1.5-3.0 seconds (target: <3s)
+## Usage Examples
+### Python Client
+```python
+from src.rag.api_client import RAGAPIClient
+# Initialize
+client = RAGAPIClient(base_url="https://your-api-url.modal.run")
+# Health check
+health = client.health_check()
+print(health)
+# Query
+result = client.query("What are the premium ranges?")
+print(result['answer'])
+# Fast query (optimized for speed)
+result = client.query_fast("What are the three tiers?")
+print(result['answer'])
+```
+### cURL
+```bash
+# Health check
+curl https://your-api-url.modal.run/health
+# Query
+curl -X POST https://your-api-url.modal.run/query \
+  -H "Content-Type: application/json" \
+  -d '{
+    "question": "What are the three product tiers?",
+    "top_k": 5,
+    "max_tokens": 1024
+  }'
+```
+### JavaScript/TypeScript
+```javascript
+const response = await fetch('https://your-api-url.modal.run/query', {
+  method: 'POST',
+  headers: {
+    'Content-Type': 'application/json',
+  },
+  body: JSON.stringify({
+    question: 'What are the three product tiers?',
+    top_k: 5,
+    max_tokens: 1024
+  })
+});
+const data = await response.json();
+console.log(data.answer);
+```
+## Configuration
+### Environment Variables
+- `MODAL_APP_NAME`: App name (default: "insurance-rag-api")
+- `MODAL_VOLUME_NAME`: Volume name (default: "mcp-hack-ins-products")
+### API Parameters
+- `question` (required): The question to ask
+- `top_k` (optional, default: 5): Number of documents to retrieve
+- `max_tokens` (optional, default: 1024): Maximum response length
+## Performance Tips
+1. **Use Fast Query**: For speed-critical applications, use `query_fast()` method
+2. **Reduce top_k**: Lower `top_k` (e.g., 3) for faster retrieval
+3. **Reduce max_tokens**: Lower `max_tokens` (e.g., 512) for faster generation
+4. **Cache Results**: Cache common queries client-side
+5. **Batch Requests**: If possible, batch multiple queries
+## Error Handling
+```python
+result = client.query("your question")
+if result.get("success"):
+    print(result['answer'])
+else:
+    print(f"Error: {result.get('error', 'Unknown error')}")
+```
+## Monitoring
+### Response Times
+Monitor the `total_time` field in responses:
+- < 2s: Excellent
+- 2-3s: Good (target)
+- > 3s: May need optimization
+### Health Monitoring
+```python
+health = client.health_check()
+if health.get("status") != "healthy":
+    # Handle unhealthy state
+    pass
+```
+## Deployment
+### Modal Deployment
+```bash
+# Deploy
+modal deploy src/rag/rag_api.py
+# Get URL
+modal app show insurance-rag-api
+```
+### Local Testing
+```bash
+# Run locally (for development)
+modal serve src/rag/rag_api.py
+```
+## Rate Limiting
+The API supports up to 10 concurrent requests. For higher throughput:
+- Deploy multiple instances
+- Use load balancer
+- Implement client-side rate limiting
+## Security
+- Add authentication if needed
+- Use HTTPS in production
+- Implement rate limiting
+- Validate input questions
+## Troubleshooting
+### Slow Responses (>3s)
+- Check if container is warm (`min_containers=1`)
+- Reduce `max_tokens`
+- Reduce `top_k`
+- Check network latency
+### Errors
+- Verify documents are indexed
+- Check Modal app status
+- Review error messages in response

docs/deployment/ADD_GUIDES_TO_RAG.md DELETED Viewed

@@ -1,146 +0,0 @@
-# RAG Indexing Configuration
-## Overview
-The RAG system indexes **only Word, PDF, and Excel files** containing product design information. **All markdown files are excluded** from indexing to keep the RAG focused on structured product documents.
-## Currently Indexed Files
-The system automatically indexes files that match these patterns:
-1. **Word Documents (.docx):**
-   - Files with `tokyo_auto_insurance` or `product_design` in the filename
-   - Example: `tokyo_auto_insurance_product_design.docx`
-2. **PDF Documents (.pdf):**
-   - Files with `tokyo_auto_insurance` or `product_design` in the filename
-   - Example: `tokyo_auto_insurance_product_design.pdf`
-3. **Excel Spreadsheets (.xlsx, .xls):**
-   - Files with `tokyo_auto_insurance` or `product_design` in the filename
-   - Example: `tokyo_auto_insurance_product_design.xlsx`
-## Excluded Files
-The following files are **NOT indexed**:
-- ❌ **All markdown files** (`.md`, `.markdown`) - completely excluded
-- ❌ Guide files (e.g., `QUICK_START_RAG.md`, `PRODUCT_DECISION_GUIDE.md`)
-- ❌ Setup guides (e.g., `setup_product_design_rag.md`)
-- ❌ Troubleshooting guides
-- ❌ Web interface guides
-- ❌ Any other file types (`.txt`, `.csv`, `.json`, etc.)
-## Files That Will Be Indexed
-Based on the current repository structure:
-✅ **Will be indexed (if uploaded to Modal volume):**
-- `tokyo_auto_insurance_product_design.docx` (Word document)
-- `tokyo_auto_insurance_product_design.pdf` (PDF document)
-- `tokyo_auto_insurance_product_design.xlsx` (Excel spreadsheet)
-- `tokyo_auto_insurance_product_design.xls` (Excel 97-2003)
-❌ **Will NOT be indexed (all excluded):**
-- `tokyo_auto_insurance_product_design.md` (markdown - excluded)
-- `tokyo_auto_insurance_product_design_filled.md` (markdown - excluded)
-- `QUICK_START_RAG.md` (markdown - excluded)
-- `PRODUCT_DECISION_GUIDE.md` (markdown - excluded)
-- `setup_product_design_rag.md` (markdown - excluded)
-- `TROUBLESHOOTING.md` (markdown - excluded)
-- `WEB_INTERFACE.md` (markdown - excluded)
-- All other markdown and non-supported file types
-## How to Add More Product Design Files
-### Option 1: Use Supported File Formats
-Convert your files to one of the supported formats:
-- **Word**: `.docx` format
-- **PDF**: `.pdf` format
-- **Excel**: `.xlsx` or `.xls` format
-**Important:**
-- The file must contain `tokyo_auto_insurance` **OR** `product_design` in the filename
-- Markdown files (`.md`) are **not supported** and will be ignored
-### Option 2: Update the Loader
-Edit `src/rag/modal-rag-product-design.py` and modify the pattern matching:
-```python
-# Current pattern for PDF files (line ~81):
-if 'tokyo_auto_insurance' in file_lower or 'product_design' in file_lower:
-    pdf_files.append(full_path)
-# To add more patterns, modify to:
-if ('tokyo_auto_insurance' in file_lower or
-    'product_design' in file_lower or
-    'your_custom_pattern' in file_lower):
-    pdf_files.append(full_path)
-```
-**Note:** All markdown files are intentionally excluded. Only Word, PDF, and Excel files are processed.
-## Uploading to Modal Volume
-To index product design documents, upload **only Word, PDF, or Excel files** to the Modal volume:
-```bash
-# Upload Word document
-modal volume put mcp-hack-ins-products \
-  docs/product-design/tokyo_auto_insurance_product_design.docx \
-  docs/product-design/tokyo_auto_insurance_product_design.docx
-# Upload PDF document (if you have one)
-modal volume put mcp-hack-ins-products \
-  docs/product-design/tokyo_auto_insurance_product_design.pdf \
-  docs/product-design/tokyo_auto_insurance_product_design.pdf
-# Upload Excel spreadsheet (if you have one)
-modal volume put mcp-hack-ins-products \
-  docs/product-design/tokyo_auto_insurance_product_design.xlsx \
-  docs/product-design/tokyo_auto_insurance_product_design.xlsx
-```
-**Important Notes:**
-- ❌ **Do NOT upload markdown files** (`.md`) - they will be ignored
-- ✅ Only `.docx`, `.pdf`, `.xlsx`, and `.xls` files are processed
-- ✅ Files must contain `tokyo_auto_insurance` or `product_design` in the filename
-## Re-indexing
-After uploading new files, re-index:
-```bash
-# Using CLI
-python src/web/query_product_design.py --index
-# Or direct Modal command
-modal run src/rag/modal-rag-product-design.py::index_product_design
-```
-## Benefits of Current Approach
-By focusing only on Word, PDF, and Excel files:
-- ✅ RAG answers are focused on structured product documents
-- ✅ No confusion from markdown guide/instruction content
-- ✅ Faster retrieval (smaller, more focused document set)
-- ✅ More accurate product-related answers from official documents
-- ✅ Better handling of tables and structured data (Excel, Word tables)
-- ✅ Cleaner source citations
-- ✅ Support for professional document formats
-## Example Queries
-With product design documents indexed, you can ask:
-```
-"What are the three product tiers and their premium ranges?"
-"What is the Year 3 premium volume projection?"
-"What are the FSA licensing requirements?"
-"What coverage does the Standard tier include?"
-"What is the target market size in Tokyo?"
-"Who are the main competitors?"
-```
-The RAG system will retrieve relevant sections from the product design documents only, ensuring answers are focused on product information.

docs/guides/HOW_TO_RUN.md DELETED Viewed

@@ -1,215 +0,0 @@
-# How to Run the Fine-Tuning Pipeline
-This guide walks you through the complete pipeline from data generation to model deployment.
----
-## 📊 Dataset Generation Results
-### Final Statistics
-- **Training Samples**: 201,651
-- **Validation Samples**: 22,407
-- **Total Dataset**: 224,058 high-quality QA pairs
-- **Improvement**: 150x more data than previous approach
-### Batch Performance
-| Batch | Files | Data Points | Status |
-|-------|-------|-------------|--------|
-| 1 | 1,000 | 100,611 | ✅ Excellent |
-| 2 | 1,000 | 39,960 | ✅ Good |
-| 3 | 1,000 | 0 | ⚠️ Complex files |
-| 4 | 1,000 | 600 | ⚠️ Runner issue |
-| 5 | 1,000 | 54,627 | ✅ Excellent |
-| 6 | 1,000 | 5,400 | ✅ Good |
-| 7 | 888 | 22,860 | ✅ Good |
----
-## 🚀 Step-by-Step Instructions
-### Step 1: Fine-Tune the Model
-Run the fine-tuning job on Modal with H200 GPU:
-```bash
-cd /Users/veeru/agents/mcp-hack
-# Start fine-tuning in detached mode
-./venv/bin/modal run --detach docs/finetune_modal.py
-```
-**What happens:**
-- Loads 201,651 training samples from `finetune-dataset` volume
-- Trains Phi-3-mini-4k-instruct with LoRA on H200 GPU
-- Runs for ~90-120 minutes
-- Saves model to `model-checkpoints` volume
-**Monitor progress:**
-```bash
-# View live logs
-modal app logs mcp-hack::finetune-phi3-modal
-```
----
-### Step 2: Evaluate the Model
-After training completes, test the model:
-```bash
-./venv/bin/modal run docs/eval_finetuned.py
-```
-This will run sample questions and show the model's answers.
----
-### Step 3: Deploy API Endpoint
-Deploy the inference API:
-**Option A: GPU Endpoint (A10G)**
-```bash
-./venv/bin/modal deploy docs/api_endpoint.py
-```
-**Option B: CPU Endpoint**
-```bash
-./venv/bin/modal deploy docs/api_endpoint_cpu.py
-```
-**Get the endpoint URL:**
-```bash
-modal app list
-```
----
-### Step 4: Test the API
-```bash
-# Example API call
-curl -X POST https://YOUR-MODAL-URL/ask \
-  -H "Content-Type: application/json" \
-  -d '{
-    "question": "What is the population of Tokyo?",
-    "context": "Japan Census data"
-  }'
-```
----
-## 📁 Key Files
-### Data Processing
-- `docs/prepare_finetune_data.py` - Generates dataset from CSV files
-- `docs/clean_sample.py` - Local testing script for data cleaning
-### Model Training
-- `docs/finetune_modal.py` - Fine-tuning script (H200 GPU)
-- `docs/eval_finetuned.py` - Evaluation script
-### API Deployment
-- `docs/api_endpoint.py` - GPU inference endpoint (A10G)
-- `docs/api_endpoint_cpu.py` - CPU inference endpoint
-### Documentation
-- `diagrams/finetuning.svg` - Visual pipeline diagram
-- `finetune/04-evaluation.md` - Evaluation results
----
-## 🔧 Modal Volumes
-The pipeline uses these Modal volumes:
-| Volume | Purpose | Size |
-|--------|---------|------|
-| `census-data` | Raw census CSV files | 6,838 files |
-| `economy-labor-data` | Raw economy CSV files | 50 files |
-| `finetune-dataset` | Generated JSONL training data | 224K samples |
-| `model-checkpoints` | Fine-tuned model weights | ~7GB |
----
-## 💡 Tips
-### If Training Fails
-```bash
-# Check logs for errors
-modal app logs mcp-hack::finetune-phi3-modal
-# Restart training
-./venv/bin/modal run --detach docs/finetune_modal.py
-```
-### If You Need to Regenerate Data
-```bash
-# Clear existing dataset
-./venv/bin/modal run docs/clear_dataset.py
-# Regenerate with new logic
-./venv/bin/modal run --detach docs/prepare_finetune_data.py
-```
-### View Volume Contents
-```bash
-# List files in a volume
-modal volume ls finetune-dataset
-# Download a file
-modal volume get finetune-dataset train.jsonl finetune/train.jsonl
-```
----
-## 📈 Expected Timeline
-| Step | Duration | Notes |
-|------|----------|-------|
-| Data Generation | ✅ Complete | 224K samples ready |
-| Fine-Tuning | ~90-120 min | H200 GPU |
-| Evaluation | ~5 min | Quick tests |
-| API Deployment | ~2 min | Instant after deploy |
----
-## 🎯 Next Steps
-1. **Run fine-tuning** (see Step 1 above)
-2. **Wait for completion** (~2 hours)
-3. **Evaluate results** (see Step 2)
-4. **Deploy API** (see Step 3)
-5. **Test with real queries** (see Step 4)
----
-## 📞 Troubleshooting
-**Issue**: "Volume not found"
-```bash
-# List all volumes
-modal volume list
-```
-**Issue**: "Out of memory during training"
-- Reduce `per_device_train_batch_size` in `finetune_modal.py`
-- Current: 2 (already optimized for H200)
-**Issue**: "Model not loading in API"
-- Ensure fine-tuning completed successfully
-- Check `model-checkpoints` volume has files
----
-## ✅ Success Criteria
-After completing all steps, you should have:
-- ✅ Fine-tuned Phi-3-mini model
-- ✅ Deployed API endpoint
-- ✅ Model answering questions about Japanese census/economy data
-- ✅ Improved accuracy over base model
----
-**Ready to start?** Run the fine-tuning command from Step 1!

docs/guides/SETUP_SUCCESS.md DELETED Viewed

@@ -1,63 +0,0 @@
-# ✅ RAG Setup Successful!
-## Status: Working
-The product design RAG system is now fully operational!
-### What Was Fixed
-1. **File Detection**: Updated to find files in both root and `docs/` subdirectory
-2. **GPU Fallback**: Added CPU fallback for embeddings (works without GPU)
-3. **Word Document**: Markdown file works perfectly (Word file has python-docx issue but markdown has all content)
-4. **Modal Command**: Auto-detects Modal in venv
-### Current Status
-✅ **Indexed**: 1 document (markdown), 56 chunks
-✅ **Vector DB**: Created in ChromaDB collection `product_design`
-✅ **Queries**: Working! Tested successfully
-### Test Results
-```bash
-$ python3 query_product_design.py --query "What are the three product tiers?"
-```
-**Result**: ✅ Successfully retrieved and answered!
-## Usage
-### Query the Document
-```bash
-# Single query
-python3 query_product_design.py --query "What are the three product tiers?"
-# Interactive mode
-python3 query_product_design.py --interactive
-```
-### Example Questions
-- "What are the three product tiers and their premium ranges?"
-- "What is the Year 3 premium volume projection?"
-- "What coverage does the Standard tier include?"
-- "What are the FSA licensing requirements?"
-## Known Issues
-1. **Word Document**: The `.docx` file has a python-docx compatibility issue with Modal volumes, but the markdown file contains all the same content and works perfectly.
-2. **Answer Truncation**: Some answers may be truncated. This is normal - the system retrieves the most relevant chunks and generates concise answers.
-## Next Steps
-1. ✅ **Indexing**: Complete
-2. ✅ **Query System**: Working
-3. 🎯 **Ready to Use**: You can now query the product design document!
-Try it:
-```bash
-python3 query_product_design.py --interactive
-```

docs/guides/SUMMARY.md DELETED Viewed

@@ -1,114 +0,0 @@
-# ✅ Complete Setup Summary
-## What Was Accomplished
-### 1. Product Design Document ✅
-- **Created**: Comprehensive 1,600-line product design document
-- **Filled**: All sections with realistic fictional data for "TokyoDrive Insurance"
-- **Formats**:
-  - Markdown: `docs/tokyo_auto_insurance_product_design_filled.md`
-  - Word: `docs/tokyo_auto_insurance_product_design.docx`
-- **Content**: 12 comprehensive sections covering all aspects of product design
-### 2. RAG System Extension ✅
-- **Created**: `src/modal-rag-product-design.py`
-- **Features**:
-  - Supports Markdown and Word documents
-  - Separate ChromaDB collection (doesn't interfere with existing RAG)
-  - GPU-accelerated with Phi-3 model
-  - Integrated with existing Modal infrastructure
-### 3. Query Interface ✅
-- **Created**: `query_product_design.py` - Simple CLI tool
-- **Features**:
-  - Interactive mode for continuous queries
-  - Single query mode
-  - Index command
-  - Clean, formatted output
-### 4. Documentation ✅
-- `docs/QUICK_START_RAG.md` - Quick start guide
-- `docs/setup_product_design_rag.md` - Detailed setup
-- `docs/next_steps_rag_recommendation.md` - Decision guide
-- `docs/RAG_SETUP_COMPLETE.md` - Complete setup info
-- `README_RAG.md` - Quick reference
-## File Structure
-```
-mcp-hack/
-├── src/
-│   └── modal-rag-product-design.py    # Extended RAG system
-├── query_product_design.py             # CLI query interface
-├── docs/
-│   ├── tokyo_auto_insurance_product_design_filled.md
-│   ├── tokyo_auto_insurance_product_design.docx
-│   ├── QUICK_START_RAG.md
-│   ├── setup_product_design_rag.md
-│   ├── next_steps_rag_recommendation.md
-│   ├── RAG_SETUP_COMPLETE.md
-│   └── SUMMARY.md (this file)
-└── README_RAG.md                       # Quick reference
-```
-## Next Steps to Use
-### Step 1: Index Documents (One-Time)
-```bash
-python query_product_design.py --index
-```
-⏱️ Takes 2-5 minutes
-### Step 2: Query the Document
-```bash
-# Single query
-python query_product_design.py --query "What are the three product tiers?"
-# Interactive mode
-python query_product_design.py --interactive
-```
-## Example Use Cases
-### For Development
-- Extract technical requirements
-- Get API specifications
-- Understand system architecture
-### For Sales/Marketing
-- Get pricing information
-- Understand product features
-- Compare tiers
-### For Compliance
-- Check regulatory requirements
-- Get licensing info
-- Understand data privacy rules
-### For Financial Planning
-- Get projections
-- Understand cost structure
-- Check break-even analysis
-## Key Features
-✅ **Comprehensive Document**: 12 sections, 1,600 lines, fully filled with realistic data
-✅ **RAG System**: Semantic search + LLM for intelligent Q&A
-✅ **Easy Interface**: Simple CLI tool, no complex setup
-✅ **Fast Queries**: 3-5 seconds after initial warm-up
-✅ **Separate Collection**: Doesn't interfere with existing insurance products RAG
-## Status
-🎉 **Everything is ready!**
-1. ✅ Product design document created and filled
-2. ✅ Documents uploaded to Modal volume
-3. ✅ RAG system extended
-4. ✅ Query interface created
-5. ✅ Documentation complete
-**Ready to index and query!**
-Run: `python query_product_design.py --index`

docs/guides/modal-rag-optimization.md DELETED Viewed

@@ -1,370 +0,0 @@
-# Modal RAG Performance Optimization Guide
-**Current Performance**: >1 minute per query
-**Target Performance**: <5 seconds per query
-## 🔍 Performance Bottleneck Analysis
-### Current Architecture Issues
-1. **Model Loading Time** (~30-45 seconds)
-   - Mistral-7B (13GB) loads on every cold start
-   - Embedding model loads separately
-   - No model caching between requests
-2. **LLM Inference Time** (~15-30 seconds)
-   - Mistral-7B is slow for inference
-   - Running on A10G GPU (good, but model is large)
-   - No inference optimization (quantization, etc.)
-3. **Network Latency** (~2-5 seconds)
-   - Remote ChromaDB calls
-   - Modal container communication overhead
----
-## 🚀 Optimization Strategies (Ranked by Impact)
-### 1. **Keep Containers Warm** ⭐⭐⭐⭐⭐
-**Impact**: Eliminates 30-45s cold start time
-**Current**:
-```python
-min_containers=1  # Already doing this ✅
-```
-**Why it helps**: Your container stays loaded with models in memory. First query after deployment is slow, but subsequent queries are fast.
-**Cost**: ~$0.50-1.00/hour for warm A10G container
----
-### 2. **Switch to Smaller/Faster LLM** ⭐⭐⭐⭐⭐
-**Impact**: Reduces inference from 15-30s to 2-5s
-**Options**:
-#### Option A: Mistral-7B-Instruct-v0.2 (Quantized)
-```python
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.float16,
-    bnb_4bit_use_double_quant=True,
-    bnb_4bit_quant_type="nf4"
-)
-self.model = AutoModelForCausalLM.from_pretrained(
-    LLM_MODEL,
-    quantization_config=quantization_config,
-    device_map="auto"
-)
-```
-- **Speed**: 3-5x faster (5-10s → 1-3s)
-- **Quality**: Minimal degradation
-- **Memory**: 13GB → 3.5GB
-#### Option B: Switch to Phi-3-mini (3.8B)
-```python
-LLM_MODEL = "microsoft/Phi-3-mini-4k-instruct"
-```
-- **Speed**: 5-10x faster than Mistral-7B
-- **Quality**: Good for RAG tasks
-- **Memory**: ~8GB → 4GB
-- **Inference**: 2-4 seconds
-#### Option C: Use TinyLlama-1.1B
-```python
-LLM_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
-```
-- **Speed**: 10-20x faster
-- **Quality**: Lower, but acceptable for simple queries
-- **Memory**: ~2GB
-- **Inference**: <1 second
----
-### 3. **Use vLLM for Inference** ⭐⭐⭐⭐
-**Impact**: 2-5x faster inference
-```python
-# Install vLLM
-image = modal.Image.debian_slim(python_version="3.11").pip_install(
-    "vllm==0.6.0",
-    # ... other packages
-)
-# In RAGModel.enter()
-from vllm import LLM, SamplingParams
-self.llm_engine = LLM(
-    model=LLM_MODEL,
-    tensor_parallel_size=1,
-    gpu_memory_utilization=0.9,
-    max_model_len=2048  # Shorter context for speed
-)
-# In query method
-sampling_params = SamplingParams(
-    temperature=0.7,
-    max_tokens=256,
-    top_p=0.9
-)
-outputs = self.llm_engine.generate([prompt], sampling_params)
-```
-**Benefits**:
-- Continuous batching
-- PagedAttention (efficient memory)
-- Optimized CUDA kernels
-- 2-5x faster than HuggingFace pipeline
----
-### 4. **Optimize Embedding Generation** ⭐⭐⭐
-**Impact**: Reduces query embedding time from 1-2s to 0.2-0.5s
-#### Option A: Use Smaller Embedding Model
-```python
-EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
-# 384 dimensions vs 384 (bge-small is already good)
-```
-#### Option B: Use ONNX Runtime
-```python
-from optimum.onnxruntime import ORTModelForFeatureExtraction
-self.embeddings = ORTModelForFeatureExtraction.from_pretrained(
-    EMBEDDING_MODEL,
-    export=True,
-    provider="CUDAExecutionProvider"
-)
-```
-- **Speed**: 2-3x faster
-- **Quality**: Identical
----
-### 5. **Reduce Context Window** ⭐⭐⭐
-**Impact**: Faster LLM processing
-```python
-# In query method
-sampling_params = SamplingParams(
-    max_tokens=128,  # Instead of 256 or 512
-    temperature=0.7
-)
-# Reduce retrieved documents
-top_k = 2  # Instead of 3
-```
-**Why**: Less tokens to process = faster inference
----
-### 6. **Cache ChromaDB Queries** ⭐⭐
-**Impact**: Saves 1-2s on repeated queries
-```python
-from functools import lru_cache
-import hashlib
-@lru_cache(maxsize=100)
-def get_cached_docs(query_hash):
-    return self.retriever.get_relevant_documents(query)
-# In query method
-query_hash = hashlib.md5(question.encode()).hexdigest()
-docs = get_cached_docs(query_hash)
-```
----
-### 7. **Use Faster GPU** ⭐⭐
-**Impact**: 1.5-2x faster inference
-```python
-@app.cls(
-    gpu="A100",  # Instead of A10G
-    # or
-    gpu="H100",  # Even faster
-)
-```
-**Cost**: A100 is 2-3x more expensive than A10G
----
-### 8. **Parallel Processing** ⭐⭐
-**Impact**: Overlap embedding + retrieval
-```python
-import asyncio
-async def query_async(self, question: str):
-    # Run embedding and LLM prep in parallel
-    embedding_task = asyncio.create_task(
-        self.get_query_embedding(question)
-    )
-    # ... rest of async pipeline
-```
----
-## 🎯 Recommended Implementation Plan
-### Phase 1: Quick Wins (Get to <10s)
-1. ✅ **Keep containers warm** (already done)
-2. **Add 4-bit quantization** to Mistral-7B
-3. **Reduce max_tokens** to 128
-4. **Use top_k=2** instead of 3
-**Expected**: 60s → 8-12s
----
-### Phase 2: Major Speedup (Get to <5s)
-1. **Switch to vLLM** for inference
-2. **Use Phi-3-mini** instead of Mistral-7B
-3. **Optimize embeddings** with ONNX
-**Expected**: 8-12s → 3-5s
----
-### Phase 3: Ultra-Fast (Get to <2s)
-1. **Use TinyLlama** for simple queries
-2. **Implement query caching**
-3. **Upgrade to A100 GPU**
-**Expected**: 3-5s → 1-2s
----
-## 📊 Performance Comparison Table
-| Configuration | Cold Start | Warm Query | Cost/Hour | Quality |
-|--------------|------------|------------|-----------|---------|
-| **Current** (Mistral-7B, A10G) | 45s | 15-30s | $0.50 | ⭐⭐⭐⭐⭐ |
-| **Phase 1** (Quantized, warm) | 30s | 8-12s | $0.50 | ⭐⭐⭐⭐ |
-| **Phase 2** (vLLM + Phi-3) | 20s | 3-5s | $0.50 | ⭐⭐⭐⭐ |
-| **Phase 3** (TinyLlama, A100) | 10s | 1-2s | $1.50 | ⭐⭐⭐ |
----
-## 🔧 Code Changes for Phase 2 (Recommended)
-### 1. Update model configuration
-```python
-LLM_MODEL = "microsoft/Phi-3-mini-4k-instruct"
-EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5"  # Keep same
-```
-### 2. Add vLLM to dependencies
-```python
-image = modal.Image.debian_slim(python_version="3.11").pip_install(
-    "vllm==0.6.0",
-    "langchain==0.3.7",
-    # ... rest
-)
-```
-### 3. Update RAGModel.enter()
-```python
-from vllm import LLM, SamplingParams
-self.llm_engine = LLM(
-    model=LLM_MODEL,
-    tensor_parallel_size=1,
-    gpu_memory_utilization=0.85,
-    max_model_len=2048
-)
-self.sampling_params = SamplingParams(
-    temperature=0.7,
-    max_tokens=128,
-    top_p=0.9
-)
-```
-### 4. Update query method
-```python
-# Build prompt
-prompt = f"""Use the following context to answer the question.
-Context: {context}
-Question: {question}
-Answer:"""
-# Generate with vLLM
-outputs = self.llm_engine.generate([prompt], self.sampling_params)
-answer = outputs[0].outputs[0].text
-```
----
-## 💰 Cost vs Performance Trade-offs
-| Approach | Speed Gain | Cost Change | Implementation |
-|----------|-----------|-------------|----------------|
-| Quantization | 3-5x | $0 | Easy |
-| vLLM | 2-5x | $0 | Medium |
-| Smaller model | 5-10x | $0 | Easy |
-| A100 GPU | 1.5-2x | +200% | Easy |
-| Caching | Variable | $0 | Medium |
----
-## 🎬 Next Steps
-1. **Measure current performance** with logging
-2. **Implement Phase 1** (quantization + reduce tokens)
-3. **Test and measure** improvement
-4. **Implement Phase 2** if needed (vLLM + Phi-3)
-5. **Monitor** and iterate
----
-## 📝 Performance Monitoring Code
-Add this to track performance:
-```python
-import time
-@modal.method()
-def query(self, question: str, top_k: int = 2):
-    start = time.time()
-    # Embedding time
-    embed_start = time.time()
-    retriever = self.RemoteChromaRetriever(...)
-    embed_time = time.time() - embed_start
-    # Retrieval time
-    retrieval_start = time.time()
-    docs = retriever.get_relevant_documents(question)
-    retrieval_time = time.time() - retrieval_start
-    # LLM time
-    llm_start = time.time()
-    result = chain.invoke({"question": question})
-    llm_time = time.time() - llm_start
-    total_time = time.time() - start
-    print(f"⏱️ Performance:")
-    print(f"  Embedding: {embed_time:.2f}s")
-    print(f"  Retrieval: {retrieval_time:.2f}s")
-    print(f"  LLM: {llm_time:.2f}s")
-    print(f"  Total: {total_time:.2f}s")
-    return result
-```
-This will help you identify the exact bottleneck!

docs/guides/modal-rag-sequence.md DELETED Viewed

@@ -1,168 +0,0 @@
-# Modal RAG System - Sequence Diagrams
-This document provides sequence diagrams for the Modal RAG (Retrieval Augmented Generation) application.
-## 1. Indexing Flow (create_vector_db)
-```mermaid
-sequenceDiagram
-    participant User
-    participant Modal
-    participant CreateVectorDB as create_vector_db()
-    participant PDFLoader
-    participant TextSplitter
-    participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
-    participant ChromaDB as Remote ChromaDB
-    User->>Modal: modal run modal-rag.py::index
-    Modal->>CreateVectorDB: Execute function
-    CreateVectorDB->>PDFLoader: Load PDFs from /insurance-data
-    PDFLoader-->>CreateVectorDB: Return documents
-    CreateVectorDB->>TextSplitter: Split documents (chunk_size=1000)
-    TextSplitter-->>CreateVectorDB: Return chunks
-    CreateVectorDB->>Embeddings: Initialize (device='cuda')
-    CreateVectorDB->>Embeddings: Generate embeddings for chunks
-    Embeddings-->>CreateVectorDB: Return embeddings
-    CreateVectorDB->>ChromaDB: Connect to remote service
-    CreateVectorDB->>ChromaDB: Upsert chunks + embeddings
-    ChromaDB-->>CreateVectorDB: Confirm storage
-    CreateVectorDB-->>Modal: Complete
-    Modal-->>User: Success message
-```
-## 2. Query Flow (RAGModel.query)
-```mermaid
-sequenceDiagram
-    participant User
-    participant Modal
-    participant QueryEntrypoint as query()
-    participant RAGModel
-    participant Embeddings as HuggingFaceEmbeddings<br/>(CUDA)
-    participant ChromaRetriever as RemoteChromaRetriever
-    participant ChromaDB as Remote ChromaDB
-    participant LLM as Mistral-7B<br/>(A10G GPU)
-    participant RAGChain as LangChain RAG
-    User->>Modal: modal run modal-rag.py::query --question "..."
-    Modal->>QueryEntrypoint: Execute local entrypoint
-    QueryEntrypoint->>RAGModel: Instantiate RAGModel()
-    Note over RAGModel: @modal.enter() lifecycle
-    RAGModel->>Embeddings: Load embedding model (CUDA)
-    RAGModel->>ChromaDB: Connect to remote service
-    RAGModel->>LLM: Load Mistral-7B (A10G GPU)
-    RAGModel->>RAGModel: Initialize RemoteChromaRetriever
-    QueryEntrypoint->>RAGModel: query.remote(question)
-    RAGModel->>ChromaRetriever: Create retriever instance
-    RAGModel->>RAGChain: Build RAG chain
-    RAGChain->>ChromaRetriever: Retrieve relevant docs
-    ChromaRetriever->>Embeddings: embed_query(question)
-    Embeddings-->>ChromaRetriever: Query embedding
-    ChromaRetriever->>ChromaDB: query(embedding, k=3)
-    ChromaDB-->>ChromaRetriever: Top-k documents
-    ChromaRetriever-->>RAGChain: Return documents
-    RAGChain->>LLM: Generate answer with context
-    LLM-->>RAGChain: Generated answer
-    RAGChain-->>RAGModel: Return result
-    RAGModel-->>QueryEntrypoint: Return {answer, sources}
-    QueryEntrypoint-->>User: Display answer + sources
-```
-## 3. Web Endpoint Flow (RAGModel.web_query)
-```mermaid
-sequenceDiagram
-    participant User
-    participant Browser
-    participant Modal as Modal Platform
-    participant WebEndpoint as RAGModel.web_query
-    participant QueryMethod as RAGModel.query
-    participant RAGChain
-    participant ChromaDB
-    participant LLM
-    User->>Browser: GET https://.../web_query?question=...
-    Browser->>Modal: HTTP GET request
-    Modal->>WebEndpoint: Route to @modal.fastapi_endpoint
-    WebEndpoint->>QueryMethod: Call query.local(question)
-    Note over QueryMethod,LLM: Same flow as Query diagram
-    QueryMethod->>RAGChain: Build chain
-    RAGChain->>ChromaDB: Retrieve docs
-    RAGChain->>LLM: Generate answer
-    LLM-->>QueryMethod: Return result
-    QueryMethod-->>WebEndpoint: Return {answer, sources}
-    WebEndpoint-->>Modal: JSON response
-    Modal-->>Browser: HTTP 200 + JSON
-    Browser-->>User: Display result
-```
-## 4. Container Lifecycle (RAGModel)
-```mermaid
-sequenceDiagram
-    participant Modal
-    participant Container
-    participant RAGModel
-    participant GPU as A10G GPU
-    participant Volume as Modal Volume
-    participant ChromaDB
-    Modal->>Container: Start container (min_containers=1)
-    Container->>GPU: Allocate GPU
-    Container->>Volume: Mount /insurance-data
-    Container->>RAGModel: Call @modal.enter()
-    Note over RAGModel: Initialization phase
-    RAGModel->>RAGModel: Load HuggingFaceEmbeddings (CUDA)
-    RAGModel->>ChromaDB: Connect to remote service
-    RAGModel->>RAGModel: Load Mistral-7B (GPU)
-    RAGModel->>RAGModel: Create RemoteChromaRetriever class
-    RAGModel-->>Container: Ready
-    Container-->>Modal: Container warm and ready
-    Note over Modal,Container: Container stays warm (min_containers=1)
-    loop Handle requests
-        Modal->>RAGModel: Invoke query() method
-        RAGModel-->>Modal: Return result
-    end
-    Note over Modal,Container: Container persists until scaled down
-```
-## Key Components
-### Modal Configuration
-- **App Name**: `insurance-rag`
-- **Volume**: `mcp-hack-ins-products` mounted at `/insurance-data`
-- **GPU**: A10G for RAGModel class
-- **Autoscaling**: `min_containers=1`, `max_containers=1` (always warm)
-### Models
-- **LLM**: `mistralai/Mistral-7B-Instruct-v0.3` (GPU, float16)
-- **Embeddings**: `BAAI/bge-small-en-v1.5` (GPU, CUDA)
-### Storage
-- **Vector DB**: Remote ChromaDB service (`chroma-server-v2`)
-- **Collection**: `insurance_products`
-- **Chunk Size**: 1000 characters with 200 overlap
-### Endpoints
-- **Local Entrypoints**: `list`, `index`, `query`
-- **Web Endpoint**: `RAGModel.web_query` (FastAPI GET endpoint)

docs/guides/next_steps_rag_recommendation.md DELETED Viewed

@@ -1,77 +0,0 @@
-# Next Steps: RAG for Product Design Document
-## Should You Add RAG?
-**Recommendation: YES, but with specific use cases in mind**
-### Benefits of Adding RAG:
-1. **Requirements Extraction**: Quickly find specific requirements from the 1,600-line document
-2. **Stakeholder Q&A**: Answer questions like "What's the premium for a 28-year-old in Shibuya?"
-3. **Design Validation**: Query coverage details, pricing tiers, compliance requirements
-4. **Development Planning**: Extract technical requirements, API specs, integration needs
-5. **Competitive Analysis**: Compare your product features vs competitors mentioned in the doc
-### When RAG is NOT Needed:
-- If you just need to read/search the document manually
-- If the document is small enough to navigate easily
-- If you don't need to answer complex questions across multiple sections
-## Implementation Options
-### Option 1: Extend Existing Modal RAG (Recommended)
-- Your existing `modal-rag.py` already handles PDFs
-- Can easily add support for markdown/Word documents
-- Leverages existing ChromaDB infrastructure
-- **Effort**: Low (30-60 minutes)
-### Option 2: Simple Document Search
-- Use grep/search tools for simple queries
-- **Effort**: None (already available)
-### Option 3: Full RAG with Fine-Tuning
-- Fine-tune model on insurance domain + your product spec
-- **Effort**: High (days/weeks)
-- **Benefit**: Best accuracy for insurance-specific queries
-## Recommended Next Steps
-1. **Add Product Design Doc to RAG** (30 min)
-   - Extend `modal-rag.py` to load markdown/Word docs
-   - Index the filled product design document
-   - Test with sample queries
-2. **Create Query Interface** (1-2 hours)
-   - Simple CLI or web interface
-   - Example queries:
-     - "What are the three product tiers and their premium ranges?"
-     - "What coverage does the Standard tier include?"
-     - "What are the Year 3 financial projections?"
-3. **Use Cases to Test**:
-   - Requirements extraction for development
-   - Pricing questions for sales team
-   - Compliance checklist generation
-   - Feature comparison queries
-## Quick Decision Matrix
-| Use Case | RAG Needed? | Alternative |
-|----------|-------------|-------------|
-| Find specific section | ❌ No | Use table of contents |
-| Answer "What's the premium for X?" | ✅ Yes | Manual search |
-| Extract all requirements | ✅ Yes | Manual extraction |
-| Compare product tiers | ✅ Yes | Manual comparison |
-| Generate compliance checklist | ✅ Yes | Manual review |
-| Simple fact lookup | ⚠️ Maybe | Grep/search |
-## Recommendation
-**Start with Option 1**: Extend your existing RAG to include the product design document. It's low effort, leverages existing infrastructure, and gives you the ability to query the spec as you develop the product.
-Would you like me to:
-1. Extend `modal-rag.py` to support the product design document?
-2. Create a simple query interface?
-3. Both?

{scripts → src}/__init__.py RENAMED Viewed

File without changes

{docs → src/data}/clean_sample.py RENAMED Viewed

File without changes

{scripts → src}/data/cleanup_data.py RENAMED Viewed

File without changes

{scripts → src}/data/clear_census_volume.py RENAMED Viewed

File without changes

{scripts → src}/data/convert_census_to_csv.py RENAMED Viewed

File without changes

{scripts → src}/data/convert_economy_labor_to_csv.py RENAMED Viewed

File without changes

{scripts → src}/data/convert_to_word.py RENAMED Viewed

File without changes

{scripts → src}/data/create_custom_qa.py RENAMED Viewed

File without changes

{docs → src/data}/debug_parser.py RENAMED Viewed

File without changes

{scripts → src}/data/delete_census_csvs.py RENAMED Viewed

File without changes

{scripts → src}/data/download_census_api.py RENAMED Viewed

File without changes

{scripts → src}/data/download_census_csv_modal.py RENAMED Viewed

File without changes

{scripts → src}/data/download_census_data.py RENAMED Viewed

File without changes

{scripts → src}/data/download_census_modal.py RENAMED Viewed

File without changes

{scripts → src}/data/download_economy_labor_modal.py RENAMED Viewed

File without changes

{scripts → src}/data/fix_csv_filenames.py RENAMED Viewed

File without changes

{scripts → src}/data/prepare_economy_data.py RENAMED Viewed

File without changes

{scripts → src}/data/prepare_finetune_data.py RENAMED Viewed

File without changes

{scripts → src}/data/remove_duplicate_csvs.py RENAMED Viewed

File without changes