diff --git a/MIGRATION_GUIDE.md b/MIGRATION_GUIDE.md deleted file mode 100644 index 298982aa85dfd6b513dabe3660f8e82718896bbb..0000000000000000000000000000000000000000 --- a/MIGRATION_GUIDE.md +++ /dev/null @@ -1,81 +0,0 @@ -# Repository Restructure Migration Guide - -## What Changed - -The repository has been reorganized for better structure and maintainability. - -## File Moves - -### RAG System -- `src/modal-rag.py` → `src/rag/modal-rag.py` -- `src/modal-rag-product-design.py` → `src/rag/modal-rag-product-design.py` - -### Web Application -- `web_app.py` → `src/web/web_app.py` -- `query_product_design.py` → `src/web/query_product_design.py` -- `templates/` → `src/web/templates/` -- `static/` → `src/web/static/` - -### Scripts -- Data processing scripts → `scripts/data/` -- Setup scripts → `scripts/setup/` -- Utility scripts → `scripts/tools/` - -### Documentation -- All `.md` files → `docs/guides/` -- Product design docs → `docs/product-design/` - -### Tests -- `test_*.py` → `tests/` - -## Updated Commands - -### Old Commands (No longer work) -```bash -python web_app.py -modal run src/modal-rag-product-design.py::query_product_design -``` - -### New Commands -```bash -# Web app -python src/web/web_app.py -# Or use helper script -./scripts/setup/start_web.sh - -# Modal RAG -modal run src/rag/modal-rag-product-design.py::query_product_design --question "your question" - -# Indexing -modal run src/rag/modal-rag-product-design.py::index_product_design -``` - -## Import Path Updates - -If you have custom scripts that import from these modules, update the imports: - -```python -# Old -from query_product_design import query_rag - -# New -import sys -sys.path.insert(0, 'src/web') -from query_product_design import query_rag -``` - -## Next Steps - -1. Update any custom scripts with new import paths -2. Update CI/CD pipelines if applicable -3. Update documentation references -4. Test all functionality - -## Rollback - -If you need to rollback, all files are still in git history. You can: -```bash -git log --oneline --all -- "old/path/to/file" -git checkout -- "old/path/to/file" -``` - diff --git a/NEXT_STEPS.md b/NEXT_STEPS.md deleted file mode 100644 index e7ca127a656b1e5ec4b7b7342e157b541df6ee82..0000000000000000000000000000000000000000 --- a/NEXT_STEPS.md +++ /dev/null @@ -1,174 +0,0 @@ -# Next Steps - -## Current Status - -✅ **Completed:** -- Repository restructured and organized -- RAG system configured (Word, PDF, Excel only - no markdown) -- Web interface functional -- Nebius deployment guide created -- Documentation updated - -## Immediate Next Steps - -### 1. Test the Updated RAG System - -**Upload Product Design Documents:** -```bash -# Upload Word document (if you have it) -modal volume put mcp-hack-ins-products \ - docs/product-design/tokyo_auto_insurance_product_design.docx \ - docs/product-design/tokyo_auto_insurance_product_design.docx - -# Upload PDF (if you have one) -modal volume put mcp-hack-ins-products \ - docs/product-design/tokyo_auto_insurance_product_design.pdf \ - docs/product-design/tokyo_auto_insurance_product_design.pdf - -# Upload Excel (if you have one) -modal volume put mcp-hack-ins-products \ - docs/product-design/tokyo_auto_insurance_product_design.xlsx \ - docs/product-design/tokyo_auto_insurance_product_design.xlsx -``` - -**Re-index Documents:** -```bash -# Using CLI -python src/web/query_product_design.py --index - -# Or direct Modal command -modal run src/rag/modal-rag-product-design.py::index_product_design -``` - -**Test Queries:** -```bash -# Test via CLI -python src/web/query_product_design.py --query "What are the three product tiers?" - -# Or start web interface -python src/web/web_app.py -# Then open http://127.0.0.1:5000 in browser -``` - -### 2. Verify File Processing - -Check that the system correctly: -- ✅ Loads Word documents -- ✅ Loads PDF documents (if uploaded) -- ✅ Loads Excel files (if uploaded) -- ❌ Ignores markdown files -- ❌ Ignores other file types - -### 3. Production Readiness - -**Option A: Continue with Modal (Current Setup)** -- ✅ Already working -- ✅ No changes needed -- Just ensure documents are uploaded and indexed - -**Option B: Deploy to Nebius** -- Review: `docs/deployment/NEBIUS_DEPLOYMENT.md` -- Set up Nebius account -- Deploy RAG service and web app -- Migrate from Modal to Nebius - -## Recommended Path Forward - -### Short Term (This Week) -1. **Upload and index documents** - - Ensure Word/PDF/Excel files are in Modal volume - - Run indexing - - Test queries - -2. **Validate RAG quality** - - Ask various product questions - - Verify answer quality and accuracy - - Check source citations - -3. **Test web interface** - - Start web app - - Test from browser - - Verify all features work - -### Medium Term (Next 2 Weeks) -1. **Optimize RAG performance** - - Monitor query times - - Adjust chunk sizes if needed - - Fine-tune retrieval parameters - -2. **Add more documents** (if needed) - - Upload additional product design files - - Re-index as needed - -3. **User testing** - - Share with team/stakeholders - - Gather feedback - - Iterate on improvements - -### Long Term (Next Month) -1. **Deploy to production** - - Choose: Modal or Nebius - - Set up monitoring - - Configure auto-scaling (if needed) - -2. **Enhance features** - - Add authentication (if needed) - - Add query history - - Add export functionality - - Add analytics - -3. **Scale and optimize** - - Monitor costs - - Optimize for performance - - Add caching if needed - -## Quick Commands Reference - -```bash -# Index documents -python src/web/query_product_design.py --index - -# Query via CLI -python src/web/query_product_design.py --query "your question" - -# Start web interface -python src/web/web_app.py -# Or use helper script: -./scripts/setup/start_web.sh - -# Check Modal volume contents -modal volume list mcp-hack-ins-products -``` - -## Decision Points - -1. **Deployment Platform:** - - [ ] Stay with Modal (current) - - [ ] Migrate to Nebius - - [ ] Use both (hybrid) - -2. **Document Management:** - - [ ] Keep documents in Modal volume - - [ ] Move to object storage (S3, etc.) - - [ ] Use version control - -3. **Access Control:** - - [ ] Public access (current) - - [ ] Add authentication - - [ ] Add role-based access - -## Questions to Consider - -- Do you have Word/PDF/Excel versions of your product design documents? -- Do you need to convert markdown files to Word/PDF format? -- Are you ready to deploy to production? -- Do you need authentication/access control? -- What's your target user base? - -## Getting Help - -- **Documentation:** See `docs/` directory -- **Troubleshooting:** See `docs/guides/TROUBLESHOOTING.md` -- **Deployment:** See `docs/deployment/NEBIUS_DEPLOYMENT.md` -- **Quick Start:** See `QUICK_START.md` - diff --git a/diagrams/1-indexing-flow.mmd b/diagrams/1-indexing-flow.mmd deleted file mode 100644 index a4be2fa43c1f2b07d94192d3ce3a394f5b8776f6..0000000000000000000000000000000000000000 --- a/diagrams/1-indexing-flow.mmd +++ /dev/null @@ -1,28 +0,0 @@ -sequenceDiagram - participant User - participant Modal - participant CreateVectorDB as create_vector_db() - participant PDFLoader - participant TextSplitter - participant Embeddings as HuggingFaceEmbeddings
(CUDA) - participant ChromaDB as Remote ChromaDB - - User->>Modal: modal run modal-rag.py::index - Modal->>CreateVectorDB: Execute function - - CreateVectorDB->>PDFLoader: Load PDFs from /insurance-data - PDFLoader-->>CreateVectorDB: Return documents - - CreateVectorDB->>TextSplitter: Split documents (chunk_size=1000) - TextSplitter-->>CreateVectorDB: Return chunks - - CreateVectorDB->>Embeddings: Initialize (device='cuda') - CreateVectorDB->>Embeddings: Generate embeddings for chunks - Embeddings-->>CreateVectorDB: Return embeddings - - CreateVectorDB->>ChromaDB: Connect to remote service - CreateVectorDB->>ChromaDB: Upsert chunks + embeddings - ChromaDB-->>CreateVectorDB: Confirm storage - - CreateVectorDB-->>Modal: Complete - Modal-->>User: Success message diff --git a/diagrams/1-indexing-flow.svg b/diagrams/1-indexing-flow.svg index 20fa3fbabe654964249f4157f926b6d41dc3da00..bccdedef49627bbe1543c9106a56f402626f3f73 100644 --- a/diagrams/1-indexing-flow.svg +++ b/diagrams/1-indexing-flow.svg @@ -1 +1,99 @@ -Remote ChromaDBHuggingFaceEmbeddings(CUDA)TextSplitterPDFLoadercreate_vector_db()ModalUserRemote ChromaDBHuggingFaceEmbeddings(CUDA)TextSplitterPDFLoadercreate_vector_db()ModalUsermodal run modal-rag.py::indexExecute functionLoad PDFs from /insurance-dataReturn documentsSplit documents (chunk_size=1000)Return chunksInitialize (device='cuda')Generate embeddings for chunksReturn embeddingsConnect to remote serviceUpsert chunks + embeddingsConfirm storageCompleteSuccess message \ No newline at end of file + + + + + + + + + + RAG Indexing Flow - Document Processing + Insurance Product Documents → Vector Database + + + + + 1. PDF DOCUMENTS + 📄 Insurance PDFs + MetLife, AIG, + Japan Post, Sonpo + + + + + + + + 2. TEXT EXTRACTION + PyPDF Loader + Page-by-page + → Documents + + + + + + + + 3. TEXT CHUNKING + RecursiveCharacterSplitter + Chunk: 1000 chars + → 3,766 chunks + + + + + + + + 4. EMBEDDINGS + 📊 bge-small-en-v1.5 + 384-dim vectors + GPU accelerated + + + + + + + + 5. VECTOR DB + 💾 ChromaDB + Collection: langchain + Persisted locally + + + + + + + + 6. READY ✅ + Fast similarity search + ~400ms retrieval + + + + Statistics + Total Docs: + 3,766 + Chunk Size: + 1,000 + Overlap: + 200 + Vector Dim: + 384 + + + + LangChain 0.3.7 • ChromaDB 0.5.20 • PyPDF 5.1.0 • sentence-transformers 3.3.0 + \ No newline at end of file diff --git a/diagrams/2-query-flow-medium.mmd b/diagrams/2-query-flow-medium.mmd deleted file mode 100644 index 0aa35692e8b4958e45453c761530e639a22ab907..0000000000000000000000000000000000000000 --- a/diagrams/2-query-flow-medium.mmd +++ /dev/null @@ -1,25 +0,0 @@ -sequenceDiagram - participant User - participant Modal - participant RAGModel - participant Embeddings - participant ChromaDB - participant LLM - - User->>Modal: modal run query --question "..." - - Note over Modal,RAGModel: Container Startup (if cold) - Modal->>RAGModel: Initialize - RAGModel->>Embeddings: Load embedding model (GPU) - RAGModel->>LLM: Load Mistral-7B (GPU) - - Note over Modal,LLM: Query Processing - Modal->>RAGModel: Process question - RAGModel->>Embeddings: Convert question to vector - RAGModel->>ChromaDB: Search similar documents - ChromaDB-->>RAGModel: Top 3 matching docs - - RAGModel->>LLM: Generate answer + context - LLM-->>RAGModel: Answer - - RAGModel-->>User: Display answer + sources diff --git a/diagrams/2-query-flow-medium.svg b/diagrams/2-query-flow-medium.svg index d024cefc9782066a86b69186b96f89ef2832b0fe..20a6f0a7accd1f41b1bf27263cd8736d78a74008 100644 --- a/diagrams/2-query-flow-medium.svg +++ b/diagrams/2-query-flow-medium.svg @@ -1 +1,106 @@ -LLMChromaDBEmbeddingsRAGModelModalUserLLMChromaDBEmbeddingsRAGModelModalUserContainer Startup (if cold)Query Processingmodal run query --question "..."InitializeLoad embedding model (GPU)Load Mistral-7B (GPU)Process questionConvert question to vectorSearch similar documentsTop 3 matching docsGenerate answer + contextAnswerDisplay answer + sources \ No newline at end of file + + + + + + + + + + RAG Query Flow - Medium Detail + Optimized Retrieval-Augmented Generation Pipeline + + + + + USER QUERY + 💬 Question + via API + + + + + + + EMBEDDING + bge-small-en-v1.5 + ~2ms + + + + + + + VECTOR SEARCH + ChromaDB (Local) + ~400ms ⚡ + + + + + + + CONTEXT + Top-3 documents + + metadata + + + + + + + PROMPT + Alpaca Template + Context + Question + + + + + + + vLLM ENGINE + Phi-3-mini (Fine-tuned) + AsyncLLMEngine + ~2-3s + + + + + + + RESPONSE + ✅ Answer + 📄 Sources + Total: <3s + + + + Latency + Embed: + 2ms + Search: + 400ms + Generate: + 2-3s + + <3s ⚡ + Modal A10G + + + + Technology Stack + vLLM 0.6.3 • ChromaDB 0.5.20 • LangChain 0.3.7 • sentence-transformers 3.3.0 • FastAPI + + + Endpoint: rag-vllm-optimized | Updated: 2025-11-30 + \ No newline at end of file diff --git a/diagrams/2-query-flow-simple.mmd b/diagrams/2-query-flow-simple.mmd deleted file mode 100644 index 19b7abc9fd590ab43c4df2d6fb3e18d6504762b6..0000000000000000000000000000000000000000 --- a/diagrams/2-query-flow-simple.mmd +++ /dev/null @@ -1,19 +0,0 @@ -sequenceDiagram - participant User - participant Modal - participant RAGModel - participant ChromaDB - participant LLM as Mistral-7B - - User->>Modal: Ask question - Modal->>RAGModel: Initialize (warm container) - - Note over RAGModel: Load models on GPU - - RAGModel->>ChromaDB: Search for relevant docs - ChromaDB-->>RAGModel: Return top 3 documents - - RAGModel->>LLM: Generate answer with context - LLM-->>RAGModel: Generated answer - - RAGModel-->>User: Answer + Sources diff --git a/diagrams/2-query-flow-simple.svg b/diagrams/2-query-flow-simple.svg index ec035acaa8d7ddb6b3a5b42daefc99dac2702eec..32060b85a9fca1b0dfa41df0f86e3809d5b78ae8 100644 --- a/diagrams/2-query-flow-simple.svg +++ b/diagrams/2-query-flow-simple.svg @@ -1 +1,105 @@ -Mistral-7BChromaDBRAGModelModalUserMistral-7BChromaDBRAGModelModalUserLoad models on GPUAsk questionInitialize (warm container)Search for relevant docsReturn top 3 documentsGenerate answer with contextGenerated answerAnswer + Sources \ No newline at end of file + + + + + + + + + + RAG Query Flow - vLLM Optimized + High-Performance Retrieval-Augmented Generation + + + + + 1. USER QUERY + 💬 + "What insurance + products available?" + + + + + + + + 2. EMBEDDING + 📊 bge-small-en-v1.5 + GPU: CUDA + ~2ms + + + + + + + + 3. VECTOR SEARCH + 💾 ChromaDB (Local) + 3,766 documents + ~400ms ⚡ + + + + + + + + 4. CONTEXT + Top 3 documents + + metadata + + + + + + + + 5. vLLM GENERATION + 🤖 Phi-3-mini (Fine-tuned) + AsyncLLMEngine + GPU Memory: 70% + ~2-3s ⚡ + + + + + + + + 6. RESPONSE + ✅ Answer + 📄 Sources + ⏱️ Metrics + Total: <3s + + + + Performance Breakdown + Embedding: + ~2ms + Vector Search: + ~400ms + LLM Generation: + ~2-3s + TOTAL: <3s ✨ + + + + Architecture: Modal A10G GPU + vLLM 0.6.3 • ChromaDB 0.5.20 • LangChain 0.3.7 • sentence-transformers 3.3.0 + + + Endpoint: rag-vllm-optimized | Updated: 2025-11-30 + \ No newline at end of file diff --git a/diagrams/2-query-flow.mmd b/diagrams/2-query-flow.mmd deleted file mode 100644 index 60b089f68b689e74aa8c99da0767839771e76b9a..0000000000000000000000000000000000000000 --- a/diagrams/2-query-flow.mmd +++ /dev/null @@ -1,39 +0,0 @@ -sequenceDiagram - participant User - participant Modal - participant QueryEntrypoint as query() - participant RAGModel - participant Embeddings as HuggingFaceEmbeddings
(CUDA) - participant ChromaRetriever as RemoteChromaRetriever - participant ChromaDB as Remote ChromaDB - participant LLM as Mistral-7B
(A10G GPU) - participant RAGChain as LangChain RAG - - User->>Modal: modal run modal-rag.py::query --question "..." - Modal->>QueryEntrypoint: Execute local entrypoint - QueryEntrypoint->>RAGModel: Instantiate RAGModel() - - Note over RAGModel: @modal.enter() lifecycle - RAGModel->>Embeddings: Load embedding model (CUDA) - RAGModel->>ChromaDB: Connect to remote service - RAGModel->>LLM: Load Mistral-7B (A10G GPU) - RAGModel->>RAGModel: Initialize RemoteChromaRetriever - - QueryEntrypoint->>RAGModel: query.remote(question) - - RAGModel->>ChromaRetriever: Create retriever instance - RAGModel->>RAGChain: Build RAG chain - - RAGChain->>ChromaRetriever: Retrieve relevant docs - ChromaRetriever->>Embeddings: embed_query(question) - Embeddings-->>ChromaRetriever: Query embedding - ChromaRetriever->>ChromaDB: query(embedding, k=3) - ChromaDB-->>ChromaRetriever: Top-k documents - ChromaRetriever-->>RAGChain: Return documents - - RAGChain->>LLM: Generate answer with context - LLM-->>RAGChain: Generated answer - RAGChain-->>RAGModel: Return result - - RAGModel-->>QueryEntrypoint: Return {answer, sources} - QueryEntrypoint-->>User: Display answer + sources diff --git a/diagrams/2-query-flow.svg b/diagrams/2-query-flow.svg index c7a2cc4c5d0012adad9bfb48f6ee070a4a0b03a2..ec1219e45886634ef4496dfc64c892d02a2840a0 100644 --- a/diagrams/2-query-flow.svg +++ b/diagrams/2-query-flow.svg @@ -1 +1,138 @@ -LangChain RAGMistral-7B(A10G GPU)Remote ChromaDBRemoteChromaRetrieverHuggingFaceEmbeddings(CUDA)RAGModelquery()ModalUserLangChain RAGMistral-7B(A10G GPU)Remote ChromaDBRemoteChromaRetrieverHuggingFaceEmbeddings(CUDA)RAGModelquery()ModalUser@modal.enter() lifecyclemodal run modal-rag.py::query --question "..."Execute local entrypointInstantiate RAGModel()Load embedding model (CUDA)Connect to remote serviceLoad Mistral-7B (A10G GPU)Initialize RemoteChromaRetrieverquery.remote(question)Create retriever instanceBuild RAG chainRetrieve relevant docsembed_query(question)Query embeddingquery(embedding, k=3)Top-k documentsReturn documentsGenerate answer with contextGenerated answerReturn resultReturn {answer, sources}Display answer + sources \ No newline at end of file + + + + + + + + + + RAG Query Flow - Detailed Architecture + vLLM-Optimized Retrieval-Augmented Generation System + + + + + USER REQUEST + 💬 Question + + + + + + + API ENDPOINT + FastAPI POST + + + + + + RAG Model Container (Modal A10G GPU) + + + + + EMBEDDING MODEL + bge-small-en-v1.5 + GPU: CUDA + ~2ms + + + + + + + VECTOR DB + ChromaDB (Local) + 3,766 docs + ~400ms + + + + + + + CONTEXT BUILDER + Top-3 documents + + metadata + + + + + + + PROMPT BUILDER + Alpaca Template + Context + Question + + + + + + + vLLM AsyncLLMEngine + + + Fine-tuned Model + Phi-3-mini-4k + merged_model/ + + + Generation + GPU: 70% + ~2-3s + + + + + + + RESPONSE FORMATTER + Answer + Sources + Metrics + JSON Response + + + + Performance Metrics + + Embedding Generation: + ~2ms + + Vector Search (Local): + ~400ms + + LLM Generation (vLLM): + ~2-3s + + + TOTAL LATENCY: <3s ⚡ + + + + Technology Stack + + Inference: + • vLLM 0.6.3 (AsyncLLMEngine) + + Retrieval: + • ChromaDB 0.5.20 (Local) + • sentence-transformers 3.3.0 + + Framework: + • LangChain 0.3.7 + + + Endpoint: rag-vllm-optimized | Infrastructure: Modal A10G GPU | Updated: 2025-11-30 + \ No newline at end of file diff --git a/diagrams/3-web-endpoint-flow.mmd b/diagrams/3-web-endpoint-flow.mmd deleted file mode 100644 index e6b7f942a72d721bfbfafb39a5efe97c43e32808..0000000000000000000000000000000000000000 --- a/diagrams/3-web-endpoint-flow.mmd +++ /dev/null @@ -1,26 +0,0 @@ -sequenceDiagram - participant User - participant Browser - participant Modal as Modal Platform - participant WebEndpoint as RAGModel.web_query - participant QueryMethod as RAGModel.query - participant RAGChain - participant ChromaDB - participant LLM - - User->>Browser: GET https://.../web_query?question=... - Browser->>Modal: HTTP GET request - Modal->>WebEndpoint: Route to @modal.fastapi_endpoint - - WebEndpoint->>QueryMethod: Call query.local(question) - - Note over QueryMethod,LLM: Same flow as Query diagram - QueryMethod->>RAGChain: Build chain - RAGChain->>ChromaDB: Retrieve docs - RAGChain->>LLM: Generate answer - LLM-->>QueryMethod: Return result - - QueryMethod-->>WebEndpoint: Return {answer, sources} - WebEndpoint-->>Modal: JSON response - Modal-->>Browser: HTTP 200 + JSON - Browser-->>User: Display result diff --git a/diagrams/3-web-endpoint-flow.svg b/diagrams/3-web-endpoint-flow.svg index d3894960789320bbcdbe60d23fc34588cec454d6..04fe74de0f39143d11c946b52887319df767fb8d 100644 --- a/diagrams/3-web-endpoint-flow.svg +++ b/diagrams/3-web-endpoint-flow.svg @@ -1 +1,94 @@ -LLMChromaDBRAGChainRAGModel.queryRAGModel.web_queryModal PlatformBrowserUserLLMChromaDBRAGChainRAGModel.queryRAGModel.web_queryModal PlatformBrowserUserSame flow as Query diagramGET https://.../web_query?question=...HTTP GET requestRoute to @modal.fastapi_endpointCall query.local(question)Build chainRetrieve docsGenerate answerReturn resultReturn {answer, sources}JSON responseHTTP 200 + JSONDisplay result \ No newline at end of file + + + + + + + + + + Web Endpoint Flow - FastAPI Integration + HTTP API for RAG and Fine-tuned Model Inference + + + + + CLIENT REQUEST + 🌐 HTTP POST + JSON payload + + + + + + + FASTAPI ENDPOINT + @modal.fastapi_endpoint + Async handler + + + + + + Modal Container (GPU) + + + + + RAG ENDPOINT + rag-vllm-optimized + ChromaDB + vLLM + <3s latency + + + + + FINE-TUNED API + phi3-inference-vllm + Merged model + <3s latency + + + + + PROCESSING LAYER + • Request validation • Model inference + • Response formatting • Error handling + + + + GPU Resources: A10G + Shared: Embeddings + vLLM Engine + + + + + RESPONSE + ✅ Answer + 📊 Metrics + 📄 Sources (RAG) + JSON format + + + + + + Available Endpoints + RAG: + rag-vllm-optimized-ragmodel-query + Fine-tuned: + phi3-inference-vllm-model-ask + + + Infrastructure: Modal • Framework: FastAPI • Updated: 2025-11-30 + \ No newline at end of file diff --git a/diagrams/4-container-lifecycle.mmd b/diagrams/4-container-lifecycle.mmd deleted file mode 100644 index f813c5d1dfcc69cc1977f73a5843b011f7cc9380..0000000000000000000000000000000000000000 --- a/diagrams/4-container-lifecycle.mmd +++ /dev/null @@ -1,31 +0,0 @@ -sequenceDiagram - participant Modal - participant Container - participant RAGModel - participant GPU as A10G GPU - participant Volume as Modal Volume - participant ChromaDB - - Modal->>Container: Start container (min_containers=1) - Container->>GPU: Allocate GPU - Container->>Volume: Mount /insurance-data - - Container->>RAGModel: Call @modal.enter() - - Note over RAGModel: Initialization phase - RAGModel->>RAGModel: Load HuggingFaceEmbeddings (CUDA) - RAGModel->>ChromaDB: Connect to remote service - RAGModel->>RAGModel: Load Mistral-7B (GPU) - RAGModel->>RAGModel: Create RemoteChromaRetriever class - - RAGModel-->>Container: Ready - Container-->>Modal: Container warm and ready - - Note over Modal,Container: Container stays warm (min_containers=1) - - loop Handle requests - Modal->>RAGModel: Invoke query() method - RAGModel-->>Modal: Return result - end - - Note over Modal,Container: Container persists until scaled down diff --git a/diagrams/4-container-lifecycle.svg b/diagrams/4-container-lifecycle.svg index aa3d4fe075eccfa7fedd8782874bacbd811b042d..5784d43874632c4824abc8d1e32cbe00153f5099 100644 --- a/diagrams/4-container-lifecycle.svg +++ b/diagrams/4-container-lifecycle.svg @@ -1 +1,118 @@ -ChromaDBModal VolumeA10G GPURAGModelContainerModalChromaDBModal VolumeA10G GPURAGModelContainerModalInitialization phaseContainer stays warm (min_containers=1)loop[Handle requests]Container persists until scaled downStart container (min_containers=1)Allocate GPUMount /insurance-dataCall @modal.enter()Load HuggingFaceEmbeddings (CUDA)Connect to remote serviceLoad Mistral-7B (GPU)Create RemoteChromaRetriever classReadyContainer warm and readyInvoke query() methodReturn result \ No newline at end of file + + + + + + + + + + Modal Container Lifecycle - vLLM Optimized + GPU Container Management for RAG and Fine-tuned Models + + + + + 1. COLD START + Container creation + Image pull + ~30-60s + + + + + + + 2. LOAD RESOURCES + 📚 Embeddings + 💾 ChromaDB + 🤖 vLLM Engine + ~20-40s + + + + + + + 3. READY ✅ + Accepting requests + GPU warmed up + <3s latency + + + + + + + 4. PROCESSING + Handling queries + GPU inference + Concurrent requests + Async execution + + + + + 5. IDLE TIMER + No requests + Scaledown window: + 300 seconds + (5 minutes) + + + + + + + 6. SHUTDOWN + Container stopped + Resources freed + Cost savings + + + + + + New request + + + + Container Configuration + + GPU: + A10G (24GB VRAM) + + Scaledown: + 300s idle timeout + + Memory: + 70% GPU utilization + + Concurrency: + Async (multiple requests) + + Warm Start: + <100ms (if cached) + + Cold Start: + ~50-100s total + + + + Benefits of Modal Container Management + ✅ Auto-scaling • 💰 Cost optimization • ⚡ Fast warm starts • 🔄 Automatic restarts • 📊 Built-in monitoring + + + Infrastructure: Modal • GPU: A10G • Updated: 2025-11-30 + \ No newline at end of file diff --git a/diagrams/finetuning.svg b/diagrams/finetuning.svg index df1f55bf62e5e6bd208514bd0cf1d147b495b554..4a5f22660ed1e52955555f9316fda240948a15ac 100644 --- a/diagrams/finetuning.svg +++ b/diagrams/finetuning.svg @@ -1,179 +1,114 @@ - - + - - - - Fine-tuning Pipeline: CSV → Trained Model - - - - 1. Raw Data Sources - 📁 Census CSVs (6,838 files) - 📁 Economy/Labor CSVs (50 files) - • Multi-row headers with metadata - • Codes (13103) instead of names - - - - 2. CSV Structure - Row 0: Title, Unnamed:1, Unnamed:2... - Row 1-7: Metadata, notes... - Row 8: Code, Name, Population... - Row 9: 13103, Minato-ku, 260071... - ⚠️ Real data starts at Row 8+ - - - - - - - 3. Smart Parser - 📝 prepare_finetune_data.py - ✓ Skip rows with "Unnamed" - ✓ Detect header row (Row 8) - ✓ Clean values (remove codes) - ✓ Filter valid columns - - - - - - - 4. Data Extraction - For each CSV file: - 1. Read file → find header row - 2. Extract data rows (9+) - 3. Sample 500 rows per file - 4. Generate QA pairs - Result: ~1,350 training samples - - - - - - - 5. QA Generation (Current) - ❌ Problem: Uses random columns - row_label = "13103" (code!) - column = "Members per household" - value = "2.56" - Q: "What is X for 13103?" - A: "The X for 13103 is 2.56." - - - - - - - 6. Better Approach - ✓ Always use name column - row_label = "Minato-ku, Tokyo" - column = "Members per household" - value = "2.56" - Q: "What is X for Minato-ku?" - A: "The X for Minato-ku is 2.56." - - - - needs fix - - - - 7. Training Data (JSONL Format) - { - "instruction": "What is the Members per household for 1231?", - "input": "Context: Japan Census data...", - "output": "The Members per household for 1231 is 3.56." - } - - - - - - - 8. Fine-tuning - 📝 finetune_modal.py - • Model: Phi-3-mini-4k-instruct - • GPU: H200 (90 mins) - • Method: LoRA + Unsloth - - - - - - - 9. Fine-tuned Model - 🎯 Saved to Modal Volume: - model-checkpoints/ - Ready for inference! - - - - - - - 10. Inference API - 📝 api_endpoint.py (GPU - A10G) - 📝 api_endpoint_cpu.py (CPU) - POST /ask → Get answers - - - - - - - 11. User Query - Q: "Population of Tokyo?" - A: "The population for - 13100 is 13,960,000." - - - - - - - 📊 Pipeline Summary - Input: 6,888 CSV files with complex headers → Output: Fine-tuned model that answers questions - ✓ Smart header detection (skip metadata rows) - ✓ QA pair generation (1,350 samples) - ⚠️ Current issue: Uses codes (13103) instead of names (Minato-ku) - - - - 📈 Current Metrics - • Total CSV files: 6,888 - • Training samples: 1,351 - • Validation samples: 151 - • Training time: ~90 minutes (H200) - - - - ⚠️ Issues & Solutions - Issue: Row labels use codes (13103) - Solution: Always use name column - Issue: Only 1,351 samples (too small) - Solution: Fix census file parsing + Fine-Tuning Pipeline - Phi-3-mini with vLLM + High-Performance Model Training & Deployment + + + + + 1. DATA PREPARATION + 📊 Japan Census CSV + 201,651 samples + QA Generation (Gemini) + Train/Val Split (80/20) + → train.jsonl / val.jsonl + + + + + + + + 2. FINE-TUNING + 🖥️ Modal H200 GPU + Phi-3-mini-4k-instruct + LoRA (r=16, α=16) + 4-bit Quantization + 10,000 steps → adapter + + + + + + + + 3. MODEL MERGING + 🔄 Modal A10G GPU + Merge LoRA + Base + Save as bfloat16 + → merged_model/ + model-checkpoints volume + + + + + + + + 4. DEPLOYMENT + + + + Option A: Standard API + Transformers + PEFT + + + + Option B: vLLM API ⚡ + <3s latency + + + + + + + Performance Metrics + + Training Time: ~2-3 hours + GPU Memory: ~40GB (H200) + Dataset Size: 201,651 samples + + Inference Latency: + Standard API: ~10s + vLLM API: <3s ✨ + + + + Technology Stack + Modal • PyTorch 2.4.0 • Transformers 4.44.2 • PEFT 0.12.0 • vLLM 0.6.3 • bitsandbytes 0.43.3 + LoRA Fine-Tuning • 4-bit Quantization • Async Inference Engine + + + + API Endpoints + + Standard: + phi3-inference-gpu + + vLLM (Optimized): + phi3-inference-vllm + + Evaluation: + eval-finetuned + + All on Modal A10G GPU - Files: prepare_finetune_data.py → finetune_modal.py → api_endpoint.py - Modal Volumes: census-data, economy-labor-data, finetune-dataset, model-checkpoints + Updated: 2025-11-30 | Architecture: vLLM-Optimized Pipeline diff --git a/docs/NEXT_STEPS.md b/docs/NEXT_STEPS.md new file mode 100644 index 0000000000000000000000000000000000000000..4a6e7f6d35ca77ce57de1cb3c4dc3d37d0f7c578 --- /dev/null +++ b/docs/NEXT_STEPS.md @@ -0,0 +1,181 @@ +# Next Steps & Roadmap + +## ✅ Current Status + +**Completed:** +- Fine-tuning pipeline with vLLM optimization +- RAG system with local ChromaDB +- High-performance inference (<3s latency) +- Model merging for production deployment +- Comprehensive documentation + +## 🎯 Immediate Next Steps + +### 1. Test Fine-Tuned Model Performance + +```bash +# Test the vLLM-optimized endpoint +curl -X POST https://mcp-hack--phi3-inference-vllm-model-ask.modal.run \ + -H "Content-Type: application/json" \ + -d '{"question": "What is the population of Tokyo?", "context": "Japan Census data"}' +``` + +### 2. Test RAG System + +```bash +# Test the RAG endpoint +curl -X POST https://mcp-hack--rag-vllm-optimized-ragmodel-query.modal.run \ + -H "Content-Type: application/json" \ + -d '{"question": "What insurance products are available?"}' +``` + +### 3. Monitor Performance + +- Check latency metrics in responses +- Verify <3s response times +- Monitor GPU utilization on Modal dashboard + +## 🚀 Short Term (This Week) + +### Fine-Tuning Improvements +- [ ] Run evaluation script to assess model quality +- [ ] Collect more training data if needed +- [ ] Experiment with different LoRA parameters +- [ ] Test on diverse queries + +### RAG Enhancements +- [ ] Add more insurance documents to volume +- [ ] Re-index with updated documents +- [ ] Test retrieval quality +- [ ] Optimize chunk sizes if needed + +### Documentation +- [ ] Add API usage examples +- [ ] Create deployment guide +- [ ] Document troubleshooting steps + +## 📊 Medium Term (Next 2 Weeks) + +### Model Optimization +1. **Fine-tuning iterations** + - Analyze evaluation results + - Adjust training parameters + - Re-train if needed + +2. **RAG improvements** + - Experiment with different embedding models + - Optimize retrieval parameters (top-k, similarity threshold) + - Add query rewriting + +3. **Performance monitoring** + - Set up logging + - Track latency trends + - Monitor costs + +### Feature Additions +- [ ] Add streaming responses +- [ ] Implement caching layer +- [ ] Add query history +- [ ] Create admin dashboard + +## 🎨 Long Term (Next Month) + +### Production Readiness +1. **Deployment** + - Set up CI/CD pipeline + - Configure monitoring and alerts + - Implement rate limiting + - Add authentication if needed + +2. **Scaling** + - Optimize container scaling + - Implement load balancing + - Add caching (Redis) + - Set up CDN for static assets + +3. **Advanced Features** + - Multi-modal support (images, tables) + - Batch processing + - A/B testing framework + - Analytics dashboard + +## 🔧 Technical Debt + +- [ ] Remove `bkp/` directory (old backup files) +- [ ] Clean up unused dependencies +- [ ] Add comprehensive tests +- [ ] Improve error handling +- [ ] Add input validation + +## 📈 Metrics to Track + +**Performance:** +- Inference latency (target: <3s) +- Retrieval accuracy +- GPU utilization +- Cost per query + +**Quality:** +- Model accuracy on evaluation set +- RAG relevance scores +- User satisfaction (if applicable) + +## 🤔 Decision Points + +1. **Model Selection:** + - [ ] Continue with Phi-3-mini + - [ ] Experiment with larger models + - [ ] Try different base models + +2. **Infrastructure:** + - [ ] Stay with Modal (current) + - [ ] Migrate to other platform + - [ ] Self-hosted deployment + +3. **Data Strategy:** + - [ ] Expand training dataset + - [ ] Add domain-specific data + - [ ] Implement data versioning + +## 📚 Quick Reference + +### Key Commands +```bash +# Fine-tuning +./venv/bin/modal run src/finetune/finetune_modal.py + +# Model merging +./venv/bin/modal run src/finetune/merge_model.py + +# Deploy vLLM endpoint (fine-tuned) +./venv/bin/modal deploy src/finetune/api_endpoint_vllm.py + +# Deploy RAG endpoint +./venv/bin/modal deploy src/rag/rag_vllm.py + +# Evaluation +./venv/bin/modal run src/finetune/eval_finetuned.py +``` + +### Documentation +- **Main Guide:** `docs/HOW_TO_RUN.md` +- **Architecture:** `diagrams/` folder +- **Testing:** `docs/TESTING.md` +- **Agent Design:** `docs/agentdesign.md` + +## 🎯 Success Criteria + +**Phase 1 (Current):** +- ✅ <3s inference latency +- ✅ vLLM optimization working +- ✅ RAG retrieval functional + +**Phase 2 (Next):** +- [ ] >90% accuracy on evaluation set +- [ ] <2s average latency +- [ ] Production deployment complete + +**Phase 3 (Future):** +- [ ] Multi-user support +- [ ] Advanced analytics +- [ ] Cost optimization (<$X per 1K queries) diff --git a/QUICK_START.md b/docs/QUICK_START.md similarity index 100% rename from QUICK_START.md rename to docs/QUICK_START.md diff --git a/docs/QUICK_START_API.md b/docs/QUICK_START_API.md new file mode 100644 index 0000000000000000000000000000000000000000..5f998bd8302fcee841815fb609eb1d2d513b9369 --- /dev/null +++ b/docs/QUICK_START_API.md @@ -0,0 +1,75 @@ +# Quick Start: RAG API + +Fast API endpoint for querying product design documents with <3 second response times. + +## Deploy the API + +```bash +# Deploy to Modal +modal deploy src/rag/rag_api.py + +# Get the API URL +modal app show insurance-rag-api +``` + +## Use the API + +### Python Client + +```python +from src.rag.api_client import RAGAPIClient + +# Initialize client +client = RAGAPIClient(base_url="https://your-api-url.modal.run") + +# Query +result = client.query("What are the three product tiers?") +print(result['answer']) +print(f"Response time: {result['total_time']:.2f}s") +``` + +### cURL + +```bash +curl -X POST https://your-api-url.modal.run/query \ + -H "Content-Type: application/json" \ + -d '{"question": "What are the three product tiers?"}' +``` + +### JavaScript + +```javascript +const response = await fetch('https://your-api-url.modal.run/query', { + method: 'POST', + headers: { 'Content-Type': 'application/json' }, + body: JSON.stringify({ question: 'What are the three product tiers?' }) +}); + +const data = await response.json(); +console.log(data.answer); +``` + +## Test Performance + +```bash +# Test with default URL +python tests/test_api.py + +# Test with custom URL +python tests/test_api.py --url https://your-api-url.modal.run +``` + +## Performance Target + +- **Target**: <3 seconds per query +- **Typical**: 1.5-2.5 seconds +- **Optimizations**: Warm containers, reduced tokens, limited context + +## API Endpoints + +- `GET /health` - Health check +- `POST /query` - Query the RAG system +- `GET /` - API information + +See `docs/api/RAG_API.md` for full documentation. + diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000000000000000000000000000000000000..276d8409a3a305af2663b61840304d7f73f47ed5 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,58 @@ +# Documentation Index + +This directory contains all project documentation. + +## 📚 Main Guides + +### Getting Started +- **[HOW_TO_RUN.md](HOW_TO_RUN.md)** - Complete guide to running the fine-tuning pipeline +- **[QUICK_START.md](QUICK_START.md)** - Quick start guide for the project +- **[QUICK_START_API.md](QUICK_START_API.md)** - API quick start guide + +### Fine-Tuning +- **[finetune/](../finetune/)** - Fine-tuning documentation and guides + - Data preparation + - Dataset generation + - Model training + - Evaluation + +### RAG System +- **[README_RAG.md](README_RAG.md)** - RAG system overview +- **[guides/QUICK_START_RAG.md](guides/QUICK_START_RAG.md)** - RAG quick start +- **[guides/RAG_SETUP_COMPLETE.md](guides/RAG_SETUP_COMPLETE.md)** - Complete RAG setup guide +- **[api/RAG_API.md](api/RAG_API.md)** - RAG API documentation + +### Deployment +- **[deployment/](deployment/)** - Deployment guides + - **[README.md](deployment/README.md)** - Deployment overview + - **[NEBIUS_DEPLOYMENT.md](deployment/NEBIUS_DEPLOYMENT.md)** - Nebius deployment guide + +### Reference +- **[STRUCTURE.md](STRUCTURE.md)** - Project structure overview +- **[TESTING.md](TESTING.md)** - Testing guide +- **[MIGRATION_GUIDE.md](MIGRATION_GUIDE.md)** - Migration guide +- **[VLLM_MIGRATION.md](VLLM_MIGRATION.md)** - vLLM migration guide +- **[NEXT_STEPS.md](NEXT_STEPS.md)** - Next steps and roadmap + +### Agent Design +- **[agentdesign.md](agentdesign.md)** - AI agent design for automated development workflow + +### Product Design +- **[product-design/](product-design/)** - Product design guides and examples + - Product decision guide + - RAG setup for product design + - Example: Tokyo auto insurance product design + +## 🔧 Additional Resources + +### Data Sources +- **[guides/estat_api_guide.md](guides/estat_api_guide.md)** - e-Stat API guide +- **[guides/source_data.md](guides/source_data.md)** - Data source documentation +- **[guides/ft_process.md](guides/ft_process.md)** - Fine-tuning process details + +### Troubleshooting +- **[guides/TROUBLESHOOTING.md](guides/TROUBLESHOOTING.md)** - General troubleshooting +- **[guides/WEB_TROUBLESHOOTING.md](guides/WEB_TROUBLESHOOTING.md)** - Web interface troubleshooting + +### Web Interface +- **[guides/WEB_INTERFACE.md](guides/WEB_INTERFACE.md)** - Web interface documentation diff --git a/README_RAG.md b/docs/README_RAG.md similarity index 100% rename from README_RAG.md rename to docs/README_RAG.md diff --git a/STRUCTURE.md b/docs/STRUCTURE.md similarity index 100% rename from STRUCTURE.md rename to docs/STRUCTURE.md diff --git a/TESTING.md b/docs/TESTING.md similarity index 100% rename from TESTING.md rename to docs/TESTING.md diff --git a/VLLM_MIGRATION.md b/docs/VLLM_MIGRATION.md similarity index 100% rename from VLLM_MIGRATION.md rename to docs/VLLM_MIGRATION.md diff --git a/docs/api/RAG_API.md b/docs/api/RAG_API.md new file mode 100644 index 0000000000000000000000000000000000000000..313669e32c0c1cfa390dcff94db9b3bb678c34a8 --- /dev/null +++ b/docs/api/RAG_API.md @@ -0,0 +1,244 @@ +# RAG API Documentation + +Fast API endpoint for querying the product design RAG system with <3 second response times. + +## Quick Start + +### Deploy the API + +```bash +# Deploy to Modal +modal deploy src/rag/rag_api.py + +# Get the URL +modal app list +``` + +### Use the API + +```python +from src.rag.api_client import RAGAPIClient + +client = RAGAPIClient(base_url="https://your-modal-url.modal.run") +result = client.query("What are the three product tiers?") +print(result['answer']) +``` + +## API Endpoints + +### Health Check + +```http +GET /health +``` + +**Response:** +```json +{ + "status": "healthy", + "service": "rag-api" +} +``` + +### Query + +```http +POST /query +Content-Type: application/json + +{ + "question": "What are the three product tiers?", + "top_k": 5, + "max_tokens": 1024 +} +``` + +**Response:** +```json +{ + "answer": "The three product tiers are...", + "retrieval_time": 0.45, + "generation_time": 1.23, + "total_time": 1.68, + "sources": [ + { + "content": "...", + "metadata": {...} + } + ], + "success": true +} +``` + +## Performance Optimization + +### Target: <3 Second Responses + +The API is optimized for fast responses: + +1. **Warm Containers**: `min_containers=1` keeps a container ready +2. **Optimized LLM**: Reduced max_tokens (1024 vs 1536) +3. **Limited Context**: Top 3 documents, 800 chars each +4. **Prefix Caching**: Enabled for faster generation +5. **Concurrent Requests**: Up to 10 concurrent requests + +### Response Time Breakdown + +- **Retrieval**: 0.3-0.8 seconds +- **Generation**: 1.0-2.0 seconds +- **Total**: 1.5-3.0 seconds (target: <3s) + +## Usage Examples + +### Python Client + +```python +from src.rag.api_client import RAGAPIClient + +# Initialize +client = RAGAPIClient(base_url="https://your-api-url.modal.run") + +# Health check +health = client.health_check() +print(health) + +# Query +result = client.query("What are the premium ranges?") +print(result['answer']) + +# Fast query (optimized for speed) +result = client.query_fast("What are the three tiers?") +print(result['answer']) +``` + +### cURL + +```bash +# Health check +curl https://your-api-url.modal.run/health + +# Query +curl -X POST https://your-api-url.modal.run/query \ + -H "Content-Type: application/json" \ + -d '{ + "question": "What are the three product tiers?", + "top_k": 5, + "max_tokens": 1024 + }' +``` + +### JavaScript/TypeScript + +```javascript +const response = await fetch('https://your-api-url.modal.run/query', { + method: 'POST', + headers: { + 'Content-Type': 'application/json', + }, + body: JSON.stringify({ + question: 'What are the three product tiers?', + top_k: 5, + max_tokens: 1024 + }) +}); + +const data = await response.json(); +console.log(data.answer); +``` + +## Configuration + +### Environment Variables + +- `MODAL_APP_NAME`: App name (default: "insurance-rag-api") +- `MODAL_VOLUME_NAME`: Volume name (default: "mcp-hack-ins-products") + +### API Parameters + +- `question` (required): The question to ask +- `top_k` (optional, default: 5): Number of documents to retrieve +- `max_tokens` (optional, default: 1024): Maximum response length + +## Performance Tips + +1. **Use Fast Query**: For speed-critical applications, use `query_fast()` method +2. **Reduce top_k**: Lower `top_k` (e.g., 3) for faster retrieval +3. **Reduce max_tokens**: Lower `max_tokens` (e.g., 512) for faster generation +4. **Cache Results**: Cache common queries client-side +5. **Batch Requests**: If possible, batch multiple queries + +## Error Handling + +```python +result = client.query("your question") + +if result.get("success"): + print(result['answer']) +else: + print(f"Error: {result.get('error', 'Unknown error')}") +``` + +## Monitoring + +### Response Times + +Monitor the `total_time` field in responses: +- < 2s: Excellent +- 2-3s: Good (target) +- > 3s: May need optimization + +### Health Monitoring + +```python +health = client.health_check() +if health.get("status") != "healthy": + # Handle unhealthy state + pass +``` + +## Deployment + +### Modal Deployment + +```bash +# Deploy +modal deploy src/rag/rag_api.py + +# Get URL +modal app show insurance-rag-api +``` + +### Local Testing + +```bash +# Run locally (for development) +modal serve src/rag/rag_api.py +``` + +## Rate Limiting + +The API supports up to 10 concurrent requests. For higher throughput: +- Deploy multiple instances +- Use load balancer +- Implement client-side rate limiting + +## Security + +- Add authentication if needed +- Use HTTPS in production +- Implement rate limiting +- Validate input questions + +## Troubleshooting + +### Slow Responses (>3s) +- Check if container is warm (`min_containers=1`) +- Reduce `max_tokens` +- Reduce `top_k` +- Check network latency + +### Errors +- Verify documents are indexed +- Check Modal app status +- Review error messages in response + diff --git a/docs/deployment/ADD_GUIDES_TO_RAG.md b/docs/deployment/ADD_GUIDES_TO_RAG.md deleted file mode 100644 index aa673a586e45ef5be3961775d5e174785d3a1902..0000000000000000000000000000000000000000 --- a/docs/deployment/ADD_GUIDES_TO_RAG.md +++ /dev/null @@ -1,146 +0,0 @@ -# RAG Indexing Configuration - -## Overview - -The RAG system indexes **only Word, PDF, and Excel files** containing product design information. **All markdown files are excluded** from indexing to keep the RAG focused on structured product documents. - -## Currently Indexed Files - -The system automatically indexes files that match these patterns: - -1. **Word Documents (.docx):** - - Files with `tokyo_auto_insurance` or `product_design` in the filename - - Example: `tokyo_auto_insurance_product_design.docx` - -2. **PDF Documents (.pdf):** - - Files with `tokyo_auto_insurance` or `product_design` in the filename - - Example: `tokyo_auto_insurance_product_design.pdf` - -3. **Excel Spreadsheets (.xlsx, .xls):** - - Files with `tokyo_auto_insurance` or `product_design` in the filename - - Example: `tokyo_auto_insurance_product_design.xlsx` - -## Excluded Files - -The following files are **NOT indexed**: - -- ❌ **All markdown files** (`.md`, `.markdown`) - completely excluded -- ❌ Guide files (e.g., `QUICK_START_RAG.md`, `PRODUCT_DECISION_GUIDE.md`) -- ❌ Setup guides (e.g., `setup_product_design_rag.md`) -- ❌ Troubleshooting guides -- ❌ Web interface guides -- ❌ Any other file types (`.txt`, `.csv`, `.json`, etc.) - -## Files That Will Be Indexed - -Based on the current repository structure: - -✅ **Will be indexed (if uploaded to Modal volume):** -- `tokyo_auto_insurance_product_design.docx` (Word document) -- `tokyo_auto_insurance_product_design.pdf` (PDF document) -- `tokyo_auto_insurance_product_design.xlsx` (Excel spreadsheet) -- `tokyo_auto_insurance_product_design.xls` (Excel 97-2003) - -❌ **Will NOT be indexed (all excluded):** -- `tokyo_auto_insurance_product_design.md` (markdown - excluded) -- `tokyo_auto_insurance_product_design_filled.md` (markdown - excluded) -- `QUICK_START_RAG.md` (markdown - excluded) -- `PRODUCT_DECISION_GUIDE.md` (markdown - excluded) -- `setup_product_design_rag.md` (markdown - excluded) -- `TROUBLESHOOTING.md` (markdown - excluded) -- `WEB_INTERFACE.md` (markdown - excluded) -- All other markdown and non-supported file types - -## How to Add More Product Design Files - -### Option 1: Use Supported File Formats -Convert your files to one of the supported formats: -- **Word**: `.docx` format -- **PDF**: `.pdf` format -- **Excel**: `.xlsx` or `.xls` format - -**Important:** -- The file must contain `tokyo_auto_insurance` **OR** `product_design` in the filename -- Markdown files (`.md`) are **not supported** and will be ignored - -### Option 2: Update the Loader -Edit `src/rag/modal-rag-product-design.py` and modify the pattern matching: - -```python -# Current pattern for PDF files (line ~81): -if 'tokyo_auto_insurance' in file_lower or 'product_design' in file_lower: - pdf_files.append(full_path) - -# To add more patterns, modify to: -if ('tokyo_auto_insurance' in file_lower or - 'product_design' in file_lower or - 'your_custom_pattern' in file_lower): - pdf_files.append(full_path) -``` - -**Note:** All markdown files are intentionally excluded. Only Word, PDF, and Excel files are processed. - -## Uploading to Modal Volume - -To index product design documents, upload **only Word, PDF, or Excel files** to the Modal volume: - -```bash -# Upload Word document -modal volume put mcp-hack-ins-products \ - docs/product-design/tokyo_auto_insurance_product_design.docx \ - docs/product-design/tokyo_auto_insurance_product_design.docx - -# Upload PDF document (if you have one) -modal volume put mcp-hack-ins-products \ - docs/product-design/tokyo_auto_insurance_product_design.pdf \ - docs/product-design/tokyo_auto_insurance_product_design.pdf - -# Upload Excel spreadsheet (if you have one) -modal volume put mcp-hack-ins-products \ - docs/product-design/tokyo_auto_insurance_product_design.xlsx \ - docs/product-design/tokyo_auto_insurance_product_design.xlsx -``` - -**Important Notes:** -- ❌ **Do NOT upload markdown files** (`.md`) - they will be ignored -- ✅ Only `.docx`, `.pdf`, `.xlsx`, and `.xls` files are processed -- ✅ Files must contain `tokyo_auto_insurance` or `product_design` in the filename - -## Re-indexing - -After uploading new files, re-index: - -```bash -# Using CLI -python src/web/query_product_design.py --index - -# Or direct Modal command -modal run src/rag/modal-rag-product-design.py::index_product_design -``` - -## Benefits of Current Approach - -By focusing only on Word, PDF, and Excel files: -- ✅ RAG answers are focused on structured product documents -- ✅ No confusion from markdown guide/instruction content -- ✅ Faster retrieval (smaller, more focused document set) -- ✅ More accurate product-related answers from official documents -- ✅ Better handling of tables and structured data (Excel, Word tables) -- ✅ Cleaner source citations -- ✅ Support for professional document formats - -## Example Queries - -With product design documents indexed, you can ask: - -``` -"What are the three product tiers and their premium ranges?" -"What is the Year 3 premium volume projection?" -"What are the FSA licensing requirements?" -"What coverage does the Standard tier include?" -"What is the target market size in Tokyo?" -"Who are the main competitors?" -``` - -The RAG system will retrieve relevant sections from the product design documents only, ensuring answers are focused on product information. - diff --git a/docs/guides/HOW_TO_RUN.md b/docs/guides/HOW_TO_RUN.md deleted file mode 100644 index 9ba7f6bf316ecb5f67f26ee6a94084aed8b8cc4f..0000000000000000000000000000000000000000 --- a/docs/guides/HOW_TO_RUN.md +++ /dev/null @@ -1,215 +0,0 @@ -# How to Run the Fine-Tuning Pipeline - -This guide walks you through the complete pipeline from data generation to model deployment. - ---- - -## 📊 Dataset Generation Results - -### Final Statistics -- **Training Samples**: 201,651 -- **Validation Samples**: 22,407 -- **Total Dataset**: 224,058 high-quality QA pairs -- **Improvement**: 150x more data than previous approach - -### Batch Performance -| Batch | Files | Data Points | Status | -|-------|-------|-------------|--------| -| 1 | 1,000 | 100,611 | ✅ Excellent | -| 2 | 1,000 | 39,960 | ✅ Good | -| 3 | 1,000 | 0 | ⚠️ Complex files | -| 4 | 1,000 | 600 | ⚠️ Runner issue | -| 5 | 1,000 | 54,627 | ✅ Excellent | -| 6 | 1,000 | 5,400 | ✅ Good | -| 7 | 888 | 22,860 | ✅ Good | - ---- - -## 🚀 Step-by-Step Instructions - -### Step 1: Fine-Tune the Model - -Run the fine-tuning job on Modal with H200 GPU: - -```bash -cd /Users/veeru/agents/mcp-hack - -# Start fine-tuning in detached mode -./venv/bin/modal run --detach docs/finetune_modal.py -``` - -**What happens:** -- Loads 201,651 training samples from `finetune-dataset` volume -- Trains Phi-3-mini-4k-instruct with LoRA on H200 GPU -- Runs for ~90-120 minutes -- Saves model to `model-checkpoints` volume - -**Monitor progress:** -```bash -# View live logs -modal app logs mcp-hack::finetune-phi3-modal -``` - ---- - -### Step 2: Evaluate the Model - -After training completes, test the model: - -```bash -./venv/bin/modal run docs/eval_finetuned.py -``` - -This will run sample questions and show the model's answers. - ---- - -### Step 3: Deploy API Endpoint - -Deploy the inference API: - -**Option A: GPU Endpoint (A10G)** -```bash -./venv/bin/modal deploy docs/api_endpoint.py -``` - -**Option B: CPU Endpoint** -```bash -./venv/bin/modal deploy docs/api_endpoint_cpu.py -``` - -**Get the endpoint URL:** -```bash -modal app list -``` - ---- - -### Step 4: Test the API - -```bash -# Example API call -curl -X POST https://YOUR-MODAL-URL/ask \ - -H "Content-Type: application/json" \ - -d '{ - "question": "What is the population of Tokyo?", - "context": "Japan Census data" - }' -``` - ---- - -## 📁 Key Files - -### Data Processing -- `docs/prepare_finetune_data.py` - Generates dataset from CSV files -- `docs/clean_sample.py` - Local testing script for data cleaning - -### Model Training -- `docs/finetune_modal.py` - Fine-tuning script (H200 GPU) -- `docs/eval_finetuned.py` - Evaluation script - -### API Deployment -- `docs/api_endpoint.py` - GPU inference endpoint (A10G) -- `docs/api_endpoint_cpu.py` - CPU inference endpoint - -### Documentation -- `diagrams/finetuning.svg` - Visual pipeline diagram -- `finetune/04-evaluation.md` - Evaluation results - ---- - -## 🔧 Modal Volumes - -The pipeline uses these Modal volumes: - -| Volume | Purpose | Size | -|--------|---------|------| -| `census-data` | Raw census CSV files | 6,838 files | -| `economy-labor-data` | Raw economy CSV files | 50 files | -| `finetune-dataset` | Generated JSONL training data | 224K samples | -| `model-checkpoints` | Fine-tuned model weights | ~7GB | - ---- - -## 💡 Tips - -### If Training Fails -```bash -# Check logs for errors -modal app logs mcp-hack::finetune-phi3-modal - -# Restart training -./venv/bin/modal run --detach docs/finetune_modal.py -``` - -### If You Need to Regenerate Data -```bash -# Clear existing dataset -./venv/bin/modal run docs/clear_dataset.py - -# Regenerate with new logic -./venv/bin/modal run --detach docs/prepare_finetune_data.py -``` - -### View Volume Contents -```bash -# List files in a volume -modal volume ls finetune-dataset - -# Download a file -modal volume get finetune-dataset train.jsonl finetune/train.jsonl -``` - ---- - -## 📈 Expected Timeline - -| Step | Duration | Notes | -|------|----------|-------| -| Data Generation | ✅ Complete | 224K samples ready | -| Fine-Tuning | ~90-120 min | H200 GPU | -| Evaluation | ~5 min | Quick tests | -| API Deployment | ~2 min | Instant after deploy | - ---- - -## 🎯 Next Steps - -1. **Run fine-tuning** (see Step 1 above) -2. **Wait for completion** (~2 hours) -3. **Evaluate results** (see Step 2) -4. **Deploy API** (see Step 3) -5. **Test with real queries** (see Step 4) - ---- - -## 📞 Troubleshooting - -**Issue**: "Volume not found" -```bash -# List all volumes -modal volume list -``` - -**Issue**: "Out of memory during training" -- Reduce `per_device_train_batch_size` in `finetune_modal.py` -- Current: 2 (already optimized for H200) - -**Issue**: "Model not loading in API" -- Ensure fine-tuning completed successfully -- Check `model-checkpoints` volume has files - ---- - -## ✅ Success Criteria - -After completing all steps, you should have: -- ✅ Fine-tuned Phi-3-mini model -- ✅ Deployed API endpoint -- ✅ Model answering questions about Japanese census/economy data -- ✅ Improved accuracy over base model - ---- - -**Ready to start?** Run the fine-tuning command from Step 1! diff --git a/docs/guides/SETUP_SUCCESS.md b/docs/guides/SETUP_SUCCESS.md deleted file mode 100644 index 5c732394ce99d7522a3e25c9b35e614a95a4d6f7..0000000000000000000000000000000000000000 --- a/docs/guides/SETUP_SUCCESS.md +++ /dev/null @@ -1,63 +0,0 @@ -# ✅ RAG Setup Successful! - -## Status: Working - -The product design RAG system is now fully operational! - -### What Was Fixed - -1. **File Detection**: Updated to find files in both root and `docs/` subdirectory -2. **GPU Fallback**: Added CPU fallback for embeddings (works without GPU) -3. **Word Document**: Markdown file works perfectly (Word file has python-docx issue but markdown has all content) -4. **Modal Command**: Auto-detects Modal in venv - -### Current Status - -✅ **Indexed**: 1 document (markdown), 56 chunks -✅ **Vector DB**: Created in ChromaDB collection `product_design` -✅ **Queries**: Working! Tested successfully - -### Test Results - -```bash -$ python3 query_product_design.py --query "What are the three product tiers?" -``` - -**Result**: ✅ Successfully retrieved and answered! - -## Usage - -### Query the Document - -```bash -# Single query -python3 query_product_design.py --query "What are the three product tiers?" - -# Interactive mode -python3 query_product_design.py --interactive -``` - -### Example Questions - -- "What are the three product tiers and their premium ranges?" -- "What is the Year 3 premium volume projection?" -- "What coverage does the Standard tier include?" -- "What are the FSA licensing requirements?" - -## Known Issues - -1. **Word Document**: The `.docx` file has a python-docx compatibility issue with Modal volumes, but the markdown file contains all the same content and works perfectly. - -2. **Answer Truncation**: Some answers may be truncated. This is normal - the system retrieves the most relevant chunks and generates concise answers. - -## Next Steps - -1. ✅ **Indexing**: Complete -2. ✅ **Query System**: Working -3. 🎯 **Ready to Use**: You can now query the product design document! - -Try it: -```bash -python3 query_product_design.py --interactive -``` - diff --git a/docs/guides/SUMMARY.md b/docs/guides/SUMMARY.md deleted file mode 100644 index e7664df886e8d21071e468501f2a6174a8ceddf6..0000000000000000000000000000000000000000 --- a/docs/guides/SUMMARY.md +++ /dev/null @@ -1,114 +0,0 @@ -# ✅ Complete Setup Summary - -## What Was Accomplished - -### 1. Product Design Document ✅ -- **Created**: Comprehensive 1,600-line product design document -- **Filled**: All sections with realistic fictional data for "TokyoDrive Insurance" -- **Formats**: - - Markdown: `docs/tokyo_auto_insurance_product_design_filled.md` - - Word: `docs/tokyo_auto_insurance_product_design.docx` -- **Content**: 12 comprehensive sections covering all aspects of product design - -### 2. RAG System Extension ✅ -- **Created**: `src/modal-rag-product-design.py` -- **Features**: - - Supports Markdown and Word documents - - Separate ChromaDB collection (doesn't interfere with existing RAG) - - GPU-accelerated with Phi-3 model - - Integrated with existing Modal infrastructure - -### 3. Query Interface ✅ -- **Created**: `query_product_design.py` - Simple CLI tool -- **Features**: - - Interactive mode for continuous queries - - Single query mode - - Index command - - Clean, formatted output - -### 4. Documentation ✅ -- `docs/QUICK_START_RAG.md` - Quick start guide -- `docs/setup_product_design_rag.md` - Detailed setup -- `docs/next_steps_rag_recommendation.md` - Decision guide -- `docs/RAG_SETUP_COMPLETE.md` - Complete setup info -- `README_RAG.md` - Quick reference - -## File Structure - -``` -mcp-hack/ -├── src/ -│ └── modal-rag-product-design.py # Extended RAG system -├── query_product_design.py # CLI query interface -├── docs/ -│ ├── tokyo_auto_insurance_product_design_filled.md -│ ├── tokyo_auto_insurance_product_design.docx -│ ├── QUICK_START_RAG.md -│ ├── setup_product_design_rag.md -│ ├── next_steps_rag_recommendation.md -│ ├── RAG_SETUP_COMPLETE.md -│ └── SUMMARY.md (this file) -└── README_RAG.md # Quick reference -``` - -## Next Steps to Use - -### Step 1: Index Documents (One-Time) -```bash -python query_product_design.py --index -``` -⏱️ Takes 2-5 minutes - -### Step 2: Query the Document -```bash -# Single query -python query_product_design.py --query "What are the three product tiers?" - -# Interactive mode -python query_product_design.py --interactive -``` - -## Example Use Cases - -### For Development -- Extract technical requirements -- Get API specifications -- Understand system architecture - -### For Sales/Marketing -- Get pricing information -- Understand product features -- Compare tiers - -### For Compliance -- Check regulatory requirements -- Get licensing info -- Understand data privacy rules - -### For Financial Planning -- Get projections -- Understand cost structure -- Check break-even analysis - -## Key Features - -✅ **Comprehensive Document**: 12 sections, 1,600 lines, fully filled with realistic data -✅ **RAG System**: Semantic search + LLM for intelligent Q&A -✅ **Easy Interface**: Simple CLI tool, no complex setup -✅ **Fast Queries**: 3-5 seconds after initial warm-up -✅ **Separate Collection**: Doesn't interfere with existing insurance products RAG - -## Status - -🎉 **Everything is ready!** - -1. ✅ Product design document created and filled -2. ✅ Documents uploaded to Modal volume -3. ✅ RAG system extended -4. ✅ Query interface created -5. ✅ Documentation complete - -**Ready to index and query!** - -Run: `python query_product_design.py --index` - diff --git a/docs/guides/modal-rag-optimization.md b/docs/guides/modal-rag-optimization.md deleted file mode 100644 index b69ed0875284b79f40a87ed51203d5ab9fe8c428..0000000000000000000000000000000000000000 --- a/docs/guides/modal-rag-optimization.md +++ /dev/null @@ -1,370 +0,0 @@ -# Modal RAG Performance Optimization Guide - -**Current Performance**: >1 minute per query -**Target Performance**: <5 seconds per query - -## 🔍 Performance Bottleneck Analysis - -### Current Architecture Issues - -1. **Model Loading Time** (~30-45 seconds) - - Mistral-7B (13GB) loads on every cold start - - Embedding model loads separately - - No model caching between requests - -2. **LLM Inference Time** (~15-30 seconds) - - Mistral-7B is slow for inference - - Running on A10G GPU (good, but model is large) - - No inference optimization (quantization, etc.) - -3. **Network Latency** (~2-5 seconds) - - Remote ChromaDB calls - - Modal container communication overhead - ---- - -## 🚀 Optimization Strategies (Ranked by Impact) - -### 1. **Keep Containers Warm** ⭐⭐⭐⭐⭐ -**Impact**: Eliminates 30-45s cold start time - -**Current**: -```python -min_containers=1 # Already doing this ✅ -``` - -**Why it helps**: Your container stays loaded with models in memory. First query after deployment is slow, but subsequent queries are fast. - -**Cost**: ~$0.50-1.00/hour for warm A10G container - ---- - -### 2. **Switch to Smaller/Faster LLM** ⭐⭐⭐⭐⭐ -**Impact**: Reduces inference from 15-30s to 2-5s - -**Options**: - -#### Option A: Mistral-7B-Instruct-v0.2 (Quantized) -```python -from transformers import AutoModelForCausalLM, BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_compute_dtype=torch.float16, - bnb_4bit_use_double_quant=True, - bnb_4bit_quant_type="nf4" -) - -self.model = AutoModelForCausalLM.from_pretrained( - LLM_MODEL, - quantization_config=quantization_config, - device_map="auto" -) -``` -- **Speed**: 3-5x faster (5-10s → 1-3s) -- **Quality**: Minimal degradation -- **Memory**: 13GB → 3.5GB - -#### Option B: Switch to Phi-3-mini (3.8B) -```python -LLM_MODEL = "microsoft/Phi-3-mini-4k-instruct" -``` -- **Speed**: 5-10x faster than Mistral-7B -- **Quality**: Good for RAG tasks -- **Memory**: ~8GB → 4GB -- **Inference**: 2-4 seconds - -#### Option C: Use TinyLlama-1.1B -```python -LLM_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" -``` -- **Speed**: 10-20x faster -- **Quality**: Lower, but acceptable for simple queries -- **Memory**: ~2GB -- **Inference**: <1 second - ---- - -### 3. **Use vLLM for Inference** ⭐⭐⭐⭐ -**Impact**: 2-5x faster inference - -```python -# Install vLLM -image = modal.Image.debian_slim(python_version="3.11").pip_install( - "vllm==0.6.0", - # ... other packages -) - -# In RAGModel.enter() -from vllm import LLM, SamplingParams - -self.llm_engine = LLM( - model=LLM_MODEL, - tensor_parallel_size=1, - gpu_memory_utilization=0.9, - max_model_len=2048 # Shorter context for speed -) - -# In query method -sampling_params = SamplingParams( - temperature=0.7, - max_tokens=256, - top_p=0.9 -) -outputs = self.llm_engine.generate([prompt], sampling_params) -``` - -**Benefits**: -- Continuous batching -- PagedAttention (efficient memory) -- Optimized CUDA kernels -- 2-5x faster than HuggingFace pipeline - ---- - -### 4. **Optimize Embedding Generation** ⭐⭐⭐ -**Impact**: Reduces query embedding time from 1-2s to 0.2-0.5s - -#### Option A: Use Smaller Embedding Model -```python -EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2" -# 384 dimensions vs 384 (bge-small is already good) -``` - -#### Option B: Use ONNX Runtime -```python -from optimum.onnxruntime import ORTModelForFeatureExtraction - -self.embeddings = ORTModelForFeatureExtraction.from_pretrained( - EMBEDDING_MODEL, - export=True, - provider="CUDAExecutionProvider" -) -``` -- **Speed**: 2-3x faster -- **Quality**: Identical - ---- - -### 5. **Reduce Context Window** ⭐⭐⭐ -**Impact**: Faster LLM processing - -```python -# In query method -sampling_params = SamplingParams( - max_tokens=128, # Instead of 256 or 512 - temperature=0.7 -) - -# Reduce retrieved documents -top_k = 2 # Instead of 3 -``` - -**Why**: Less tokens to process = faster inference - ---- - -### 6. **Cache ChromaDB Queries** ⭐⭐ -**Impact**: Saves 1-2s on repeated queries - -```python -from functools import lru_cache -import hashlib - -@lru_cache(maxsize=100) -def get_cached_docs(query_hash): - return self.retriever.get_relevant_documents(query) - -# In query method -query_hash = hashlib.md5(question.encode()).hexdigest() -docs = get_cached_docs(query_hash) -``` - ---- - -### 7. **Use Faster GPU** ⭐⭐ -**Impact**: 1.5-2x faster inference - -```python -@app.cls( - gpu="A100", # Instead of A10G - # or - gpu="H100", # Even faster -) -``` - -**Cost**: A100 is 2-3x more expensive than A10G - ---- - -### 8. **Parallel Processing** ⭐⭐ -**Impact**: Overlap embedding + retrieval - -```python -import asyncio - -async def query_async(self, question: str): - # Run embedding and LLM prep in parallel - embedding_task = asyncio.create_task( - self.get_query_embedding(question) - ) - - # ... rest of async pipeline -``` - ---- - -## 🎯 Recommended Implementation Plan - -### Phase 1: Quick Wins (Get to <10s) -1. ✅ **Keep containers warm** (already done) -2. **Add 4-bit quantization** to Mistral-7B -3. **Reduce max_tokens** to 128 -4. **Use top_k=2** instead of 3 - -**Expected**: 60s → 8-12s - ---- - -### Phase 2: Major Speedup (Get to <5s) -1. **Switch to vLLM** for inference -2. **Use Phi-3-mini** instead of Mistral-7B -3. **Optimize embeddings** with ONNX - -**Expected**: 8-12s → 3-5s - ---- - -### Phase 3: Ultra-Fast (Get to <2s) -1. **Use TinyLlama** for simple queries -2. **Implement query caching** -3. **Upgrade to A100 GPU** - -**Expected**: 3-5s → 1-2s - ---- - -## 📊 Performance Comparison Table - -| Configuration | Cold Start | Warm Query | Cost/Hour | Quality | -|--------------|------------|------------|-----------|---------| -| **Current** (Mistral-7B, A10G) | 45s | 15-30s | $0.50 | ⭐⭐⭐⭐⭐ | -| **Phase 1** (Quantized, warm) | 30s | 8-12s | $0.50 | ⭐⭐⭐⭐ | -| **Phase 2** (vLLM + Phi-3) | 20s | 3-5s | $0.50 | ⭐⭐⭐⭐ | -| **Phase 3** (TinyLlama, A100) | 10s | 1-2s | $1.50 | ⭐⭐⭐ | - ---- - -## 🔧 Code Changes for Phase 2 (Recommended) - -### 1. Update model configuration -```python -LLM_MODEL = "microsoft/Phi-3-mini-4k-instruct" -EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5" # Keep same -``` - -### 2. Add vLLM to dependencies -```python -image = modal.Image.debian_slim(python_version="3.11").pip_install( - "vllm==0.6.0", - "langchain==0.3.7", - # ... rest -) -``` - -### 3. Update RAGModel.enter() -```python -from vllm import LLM, SamplingParams - -self.llm_engine = LLM( - model=LLM_MODEL, - tensor_parallel_size=1, - gpu_memory_utilization=0.85, - max_model_len=2048 -) - -self.sampling_params = SamplingParams( - temperature=0.7, - max_tokens=128, - top_p=0.9 -) -``` - -### 4. Update query method -```python -# Build prompt -prompt = f"""Use the following context to answer the question. - -Context: {context} - -Question: {question} - -Answer:""" - -# Generate with vLLM -outputs = self.llm_engine.generate([prompt], self.sampling_params) -answer = outputs[0].outputs[0].text -``` - ---- - -## 💰 Cost vs Performance Trade-offs - -| Approach | Speed Gain | Cost Change | Implementation | -|----------|-----------|-------------|----------------| -| Quantization | 3-5x | $0 | Easy | -| vLLM | 2-5x | $0 | Medium | -| Smaller model | 5-10x | $0 | Easy | -| A100 GPU | 1.5-2x | +200% | Easy | -| Caching | Variable | $0 | Medium | - ---- - -## 🎬 Next Steps - -1. **Measure current performance** with logging -2. **Implement Phase 1** (quantization + reduce tokens) -3. **Test and measure** improvement -4. **Implement Phase 2** if needed (vLLM + Phi-3) -5. **Monitor** and iterate - ---- - -## 📝 Performance Monitoring Code - -Add this to track performance: - -```python -import time - -@modal.method() -def query(self, question: str, top_k: int = 2): - start = time.time() - - # Embedding time - embed_start = time.time() - retriever = self.RemoteChromaRetriever(...) - embed_time = time.time() - embed_start - - # Retrieval time - retrieval_start = time.time() - docs = retriever.get_relevant_documents(question) - retrieval_time = time.time() - retrieval_start - - # LLM time - llm_start = time.time() - result = chain.invoke({"question": question}) - llm_time = time.time() - llm_start - - total_time = time.time() - start - - print(f"⏱️ Performance:") - print(f" Embedding: {embed_time:.2f}s") - print(f" Retrieval: {retrieval_time:.2f}s") - print(f" LLM: {llm_time:.2f}s") - print(f" Total: {total_time:.2f}s") - - return result -``` - -This will help you identify the exact bottleneck! diff --git a/docs/guides/modal-rag-sequence.md b/docs/guides/modal-rag-sequence.md deleted file mode 100644 index 9be43fb1f36d6bfb766cbf753e967d037e6d1ea6..0000000000000000000000000000000000000000 --- a/docs/guides/modal-rag-sequence.md +++ /dev/null @@ -1,168 +0,0 @@ -# Modal RAG System - Sequence Diagrams - -This document provides sequence diagrams for the Modal RAG (Retrieval Augmented Generation) application. - -## 1. Indexing Flow (create_vector_db) - -```mermaid -sequenceDiagram - participant User - participant Modal - participant CreateVectorDB as create_vector_db() - participant PDFLoader - participant TextSplitter - participant Embeddings as HuggingFaceEmbeddings
(CUDA) - participant ChromaDB as Remote ChromaDB - - User->>Modal: modal run modal-rag.py::index - Modal->>CreateVectorDB: Execute function - - CreateVectorDB->>PDFLoader: Load PDFs from /insurance-data - PDFLoader-->>CreateVectorDB: Return documents - - CreateVectorDB->>TextSplitter: Split documents (chunk_size=1000) - TextSplitter-->>CreateVectorDB: Return chunks - - CreateVectorDB->>Embeddings: Initialize (device='cuda') - CreateVectorDB->>Embeddings: Generate embeddings for chunks - Embeddings-->>CreateVectorDB: Return embeddings - - CreateVectorDB->>ChromaDB: Connect to remote service - CreateVectorDB->>ChromaDB: Upsert chunks + embeddings - ChromaDB-->>CreateVectorDB: Confirm storage - - CreateVectorDB-->>Modal: Complete - Modal-->>User: Success message -``` - -## 2. Query Flow (RAGModel.query) - -```mermaid -sequenceDiagram - participant User - participant Modal - participant QueryEntrypoint as query() - participant RAGModel - participant Embeddings as HuggingFaceEmbeddings
(CUDA) - participant ChromaRetriever as RemoteChromaRetriever - participant ChromaDB as Remote ChromaDB - participant LLM as Mistral-7B
(A10G GPU) - participant RAGChain as LangChain RAG - - User->>Modal: modal run modal-rag.py::query --question "..." - Modal->>QueryEntrypoint: Execute local entrypoint - QueryEntrypoint->>RAGModel: Instantiate RAGModel() - - Note over RAGModel: @modal.enter() lifecycle - RAGModel->>Embeddings: Load embedding model (CUDA) - RAGModel->>ChromaDB: Connect to remote service - RAGModel->>LLM: Load Mistral-7B (A10G GPU) - RAGModel->>RAGModel: Initialize RemoteChromaRetriever - - QueryEntrypoint->>RAGModel: query.remote(question) - - RAGModel->>ChromaRetriever: Create retriever instance - RAGModel->>RAGChain: Build RAG chain - - RAGChain->>ChromaRetriever: Retrieve relevant docs - ChromaRetriever->>Embeddings: embed_query(question) - Embeddings-->>ChromaRetriever: Query embedding - ChromaRetriever->>ChromaDB: query(embedding, k=3) - ChromaDB-->>ChromaRetriever: Top-k documents - ChromaRetriever-->>RAGChain: Return documents - - RAGChain->>LLM: Generate answer with context - LLM-->>RAGChain: Generated answer - RAGChain-->>RAGModel: Return result - - RAGModel-->>QueryEntrypoint: Return {answer, sources} - QueryEntrypoint-->>User: Display answer + sources -``` - -## 3. Web Endpoint Flow (RAGModel.web_query) - -```mermaid -sequenceDiagram - participant User - participant Browser - participant Modal as Modal Platform - participant WebEndpoint as RAGModel.web_query - participant QueryMethod as RAGModel.query - participant RAGChain - participant ChromaDB - participant LLM - - User->>Browser: GET https://.../web_query?question=... - Browser->>Modal: HTTP GET request - Modal->>WebEndpoint: Route to @modal.fastapi_endpoint - - WebEndpoint->>QueryMethod: Call query.local(question) - - Note over QueryMethod,LLM: Same flow as Query diagram - QueryMethod->>RAGChain: Build chain - RAGChain->>ChromaDB: Retrieve docs - RAGChain->>LLM: Generate answer - LLM-->>QueryMethod: Return result - - QueryMethod-->>WebEndpoint: Return {answer, sources} - WebEndpoint-->>Modal: JSON response - Modal-->>Browser: HTTP 200 + JSON - Browser-->>User: Display result -``` - -## 4. Container Lifecycle (RAGModel) - -```mermaid -sequenceDiagram - participant Modal - participant Container - participant RAGModel - participant GPU as A10G GPU - participant Volume as Modal Volume - participant ChromaDB - - Modal->>Container: Start container (min_containers=1) - Container->>GPU: Allocate GPU - Container->>Volume: Mount /insurance-data - - Container->>RAGModel: Call @modal.enter() - - Note over RAGModel: Initialization phase - RAGModel->>RAGModel: Load HuggingFaceEmbeddings (CUDA) - RAGModel->>ChromaDB: Connect to remote service - RAGModel->>RAGModel: Load Mistral-7B (GPU) - RAGModel->>RAGModel: Create RemoteChromaRetriever class - - RAGModel-->>Container: Ready - Container-->>Modal: Container warm and ready - - Note over Modal,Container: Container stays warm (min_containers=1) - - loop Handle requests - Modal->>RAGModel: Invoke query() method - RAGModel-->>Modal: Return result - end - - Note over Modal,Container: Container persists until scaled down -``` - -## Key Components - -### Modal Configuration -- **App Name**: `insurance-rag` -- **Volume**: `mcp-hack-ins-products` mounted at `/insurance-data` -- **GPU**: A10G for RAGModel class -- **Autoscaling**: `min_containers=1`, `max_containers=1` (always warm) - -### Models -- **LLM**: `mistralai/Mistral-7B-Instruct-v0.3` (GPU, float16) -- **Embeddings**: `BAAI/bge-small-en-v1.5` (GPU, CUDA) - -### Storage -- **Vector DB**: Remote ChromaDB service (`chroma-server-v2`) -- **Collection**: `insurance_products` -- **Chunk Size**: 1000 characters with 200 overlap - -### Endpoints -- **Local Entrypoints**: `list`, `index`, `query` -- **Web Endpoint**: `RAGModel.web_query` (FastAPI GET endpoint) diff --git a/docs/guides/next_steps_rag_recommendation.md b/docs/guides/next_steps_rag_recommendation.md deleted file mode 100644 index 6451d29dba8ac30f742165bffcf74c72ef7b6a10..0000000000000000000000000000000000000000 --- a/docs/guides/next_steps_rag_recommendation.md +++ /dev/null @@ -1,77 +0,0 @@ -# Next Steps: RAG for Product Design Document - -## Should You Add RAG? - -**Recommendation: YES, but with specific use cases in mind** - -### Benefits of Adding RAG: - -1. **Requirements Extraction**: Quickly find specific requirements from the 1,600-line document -2. **Stakeholder Q&A**: Answer questions like "What's the premium for a 28-year-old in Shibuya?" -3. **Design Validation**: Query coverage details, pricing tiers, compliance requirements -4. **Development Planning**: Extract technical requirements, API specs, integration needs -5. **Competitive Analysis**: Compare your product features vs competitors mentioned in the doc - -### When RAG is NOT Needed: - -- If you just need to read/search the document manually -- If the document is small enough to navigate easily -- If you don't need to answer complex questions across multiple sections - -## Implementation Options - -### Option 1: Extend Existing Modal RAG (Recommended) -- Your existing `modal-rag.py` already handles PDFs -- Can easily add support for markdown/Word documents -- Leverages existing ChromaDB infrastructure -- **Effort**: Low (30-60 minutes) - -### Option 2: Simple Document Search -- Use grep/search tools for simple queries -- **Effort**: None (already available) - -### Option 3: Full RAG with Fine-Tuning -- Fine-tune model on insurance domain + your product spec -- **Effort**: High (days/weeks) -- **Benefit**: Best accuracy for insurance-specific queries - -## Recommended Next Steps - -1. **Add Product Design Doc to RAG** (30 min) - - Extend `modal-rag.py` to load markdown/Word docs - - Index the filled product design document - - Test with sample queries - -2. **Create Query Interface** (1-2 hours) - - Simple CLI or web interface - - Example queries: - - "What are the three product tiers and their premium ranges?" - - "What coverage does the Standard tier include?" - - "What are the Year 3 financial projections?" - -3. **Use Cases to Test**: - - Requirements extraction for development - - Pricing questions for sales team - - Compliance checklist generation - - Feature comparison queries - -## Quick Decision Matrix - -| Use Case | RAG Needed? | Alternative | -|----------|-------------|-------------| -| Find specific section | ❌ No | Use table of contents | -| Answer "What's the premium for X?" | ✅ Yes | Manual search | -| Extract all requirements | ✅ Yes | Manual extraction | -| Compare product tiers | ✅ Yes | Manual comparison | -| Generate compliance checklist | ✅ Yes | Manual review | -| Simple fact lookup | ⚠️ Maybe | Grep/search | - -## Recommendation - -**Start with Option 1**: Extend your existing RAG to include the product design document. It's low effort, leverages existing infrastructure, and gives you the ability to query the spec as you develop the product. - -Would you like me to: -1. Extend `modal-rag.py` to support the product design document? -2. Create a simple query interface? -3. Both? - diff --git a/scripts/__init__.py b/src/__init__.py similarity index 100% rename from scripts/__init__.py rename to src/__init__.py diff --git a/docs/clean_sample.py b/src/data/clean_sample.py similarity index 100% rename from docs/clean_sample.py rename to src/data/clean_sample.py diff --git a/scripts/data/cleanup_data.py b/src/data/cleanup_data.py similarity index 100% rename from scripts/data/cleanup_data.py rename to src/data/cleanup_data.py diff --git a/scripts/data/clear_census_volume.py b/src/data/clear_census_volume.py similarity index 100% rename from scripts/data/clear_census_volume.py rename to src/data/clear_census_volume.py diff --git a/scripts/data/convert_census_to_csv.py b/src/data/convert_census_to_csv.py similarity index 100% rename from scripts/data/convert_census_to_csv.py rename to src/data/convert_census_to_csv.py diff --git a/scripts/data/convert_economy_labor_to_csv.py b/src/data/convert_economy_labor_to_csv.py similarity index 100% rename from scripts/data/convert_economy_labor_to_csv.py rename to src/data/convert_economy_labor_to_csv.py diff --git a/scripts/data/convert_to_word.py b/src/data/convert_to_word.py similarity index 100% rename from scripts/data/convert_to_word.py rename to src/data/convert_to_word.py diff --git a/scripts/data/create_custom_qa.py b/src/data/create_custom_qa.py similarity index 100% rename from scripts/data/create_custom_qa.py rename to src/data/create_custom_qa.py diff --git a/docs/debug_parser.py b/src/data/debug_parser.py similarity index 100% rename from docs/debug_parser.py rename to src/data/debug_parser.py diff --git a/scripts/data/delete_census_csvs.py b/src/data/delete_census_csvs.py similarity index 100% rename from scripts/data/delete_census_csvs.py rename to src/data/delete_census_csvs.py diff --git a/scripts/data/download_census_api.py b/src/data/download_census_api.py similarity index 100% rename from scripts/data/download_census_api.py rename to src/data/download_census_api.py diff --git a/scripts/data/download_census_csv_modal.py b/src/data/download_census_csv_modal.py similarity index 100% rename from scripts/data/download_census_csv_modal.py rename to src/data/download_census_csv_modal.py diff --git a/scripts/data/download_census_data.py b/src/data/download_census_data.py similarity index 100% rename from scripts/data/download_census_data.py rename to src/data/download_census_data.py diff --git a/scripts/data/download_census_modal.py b/src/data/download_census_modal.py similarity index 100% rename from scripts/data/download_census_modal.py rename to src/data/download_census_modal.py diff --git a/scripts/data/download_economy_labor_modal.py b/src/data/download_economy_labor_modal.py similarity index 100% rename from scripts/data/download_economy_labor_modal.py rename to src/data/download_economy_labor_modal.py diff --git a/scripts/data/fix_csv_filenames.py b/src/data/fix_csv_filenames.py similarity index 100% rename from scripts/data/fix_csv_filenames.py rename to src/data/fix_csv_filenames.py diff --git a/scripts/data/prepare_economy_data.py b/src/data/prepare_economy_data.py similarity index 100% rename from scripts/data/prepare_economy_data.py rename to src/data/prepare_economy_data.py diff --git a/scripts/data/prepare_finetune_data.py b/src/data/prepare_finetune_data.py similarity index 100% rename from scripts/data/prepare_finetune_data.py rename to src/data/prepare_finetune_data.py diff --git a/scripts/data/remove_duplicate_csvs.py b/src/data/remove_duplicate_csvs.py similarity index 100% rename from scripts/data/remove_duplicate_csvs.py rename to src/data/remove_duplicate_csvs.py diff --git a/src/rag/api_client.py b/src/rag/api_client.py new file mode 100644 index 0000000000000000000000000000000000000000..867066b22838f546668ae384a2a7f671236da161 --- /dev/null +++ b/src/rag/api_client.py @@ -0,0 +1,103 @@ +""" +Client library for the RAG API +Use this to call the API from Python code +""" + +import requests +from typing import Optional, Dict, List + +class RAGAPIClient: + """Client for the Product Design RAG API""" + + def __init__(self, base_url: str = "http://localhost:8000"): + """ + Initialize the API client + + Args: + base_url: Base URL of the RAG API + """ + self.base_url = base_url.rstrip('/') + + def health_check(self) -> Dict: + """Check if the API is healthy""" + try: + response = requests.get(f"{self.base_url}/health", timeout=5) + response.raise_for_status() + return response.json() + except Exception as e: + return {"status": "unhealthy", "error": str(e)} + + def query( + self, + question: str, + top_k: int = 5, + max_tokens: int = 1024, + timeout: int = 5 + ) -> Dict: + """ + Query the RAG system + + Args: + question: The question to ask + top_k: Number of documents to retrieve + max_tokens: Maximum tokens in response + timeout: Request timeout in seconds + + Returns: + Dictionary with answer, timing, and sources + """ + try: + response = requests.post( + f"{self.base_url}/query", + json={ + "question": question, + "top_k": top_k, + "max_tokens": max_tokens + }, + timeout=timeout + ) + response.raise_for_status() + return response.json() + except requests.exceptions.Timeout: + return { + "success": False, + "error": f"Request timed out after {timeout} seconds" + } + except requests.exceptions.RequestException as e: + return { + "success": False, + "error": f"Request failed: {str(e)}" + } + + def query_fast(self, question: str) -> Dict: + """ + Fast query with optimized settings for <3 second responses + + Args: + question: The question to ask + + Returns: + Dictionary with answer, timing, and sources + """ + return self.query( + question=question, + top_k=3, # Fewer docs for speed + max_tokens=512, # Shorter responses + timeout=5 + ) + +# Example usage +if __name__ == "__main__": + # Initialize client + client = RAGAPIClient(base_url="http://localhost:8000") + + # Health check + print("Health check:", client.health_check()) + + # Query + result = client.query("What are the three product tiers?") + print("\nQuery result:") + print(f"Answer: {result.get('answer', 'N/A')}") + print(f"Total time: {result.get('total_time', 0):.2f}s") + print(f"Success: {result.get('success', False)}") + diff --git a/src/rag/rag_api.py b/src/rag/rag_api.py new file mode 100644 index 0000000000000000000000000000000000000000..3710dee136f5237b47865a5184c8750d3b8ab559 --- /dev/null +++ b/src/rag/rag_api.py @@ -0,0 +1,290 @@ +""" +Fast API endpoint for RAG system - optimized for <3 second responses +""" + +import modal + +app = modal.App("insurance-rag-api") + +# Reference your specific volume +vol = modal.Volume.from_name("mcp-hack-ins-products", create_if_missing=True) + +# Model configuration +LLM_MODEL = "microsoft/Phi-3-mini-4k-instruct" +EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5" + +# Build image with dependencies +image = ( + modal.Image.debian_slim(python_version="3.11") + .pip_install( + # Core ML dependencies (compatible versions) + "torch>=2.0.0", + "transformers>=4.30.0", + "sentence-transformers>=2.2.0", + "huggingface_hub>=0.15.0", + + # LangChain (compatible versions) + "langchain>=0.1.0", + "langchain-community>=0.0.13", + + # Document processing + "pypdf>=4.0.0", + "python-docx>=1.1.0", + "openpyxl>=3.1.0", + "pandas>=2.0.0", + "xlrd>=2.0.0", + + # Vector database + "chromadb>=0.4.0", + + # Web framework + "fastapi>=0.100.0", + "uvicorn[standard]>=0.20.0", + + # LLM inference (vLLM - latest stable) + "vllm>=0.4.0", + + # Utilities + "cryptography>=41.0.0", + ) +) + +@app.cls( + image=image, + volumes={"/insurance-data": vol}, + gpu="A10G", + timeout=30, # Shorter timeout for API + max_containers=2, # Allow scaling + min_containers=1, # Keep warm for fast responses + scaledown_window=300, # Keep warm for 5 minutes +) +class FastRAGService: + """Optimized RAG service for fast API responses""" + + @modal.enter() + def enter(self): + from langchain_community.embeddings import HuggingFaceEmbeddings + from vllm import LLM, SamplingParams + from langchain.schema import Document + + print("🚀 Initializing Fast RAG Service...") + + # Initialize embeddings (faster model) + self.embeddings = HuggingFaceEmbeddings( + model_name=EMBEDDING_MODEL, + model_kwargs={'device': 'cuda'}, + encode_kwargs={'normalize_embeddings': True} + ) + + # Connect to Chroma + self.chroma_service = modal.Cls.from_name("chroma-server-v2", "ChromaDB")() + + # Custom retriever + class RemoteChromaRetriever: + def __init__(self, chroma_service, embeddings, k=5): + self.chroma_service = chroma_service + self.embeddings = embeddings + self.k = k + + def get_relevant_documents(self, query: str): + query_embedding = self.embeddings.embed_query(query) + results = self.chroma_service.query.remote( + collection_name="product_design", + query_embeddings=[query_embedding], + n_results=self.k + ) + + docs = [] + if results and 'documents' in results and len(results['documents']) > 0: + for i, doc_text in enumerate(results['documents'][0]): + metadata = results.get('metadatas', [[{}]])[0][i] if 'metadatas' in results else {} + docs.append(Document(page_content=doc_text, metadata=metadata)) + + return docs + + self.Retriever = RemoteChromaRetriever + + # Load LLM with optimized settings for speed + print(" Loading LLM (optimized for speed)...") + self.llm_engine = LLM( + model=LLM_MODEL, + dtype="float16", + gpu_memory_utilization=0.9, # Higher utilization for speed + max_model_len=4096, + trust_remote_code=True, + enforce_eager=True, + enable_prefix_caching=True, # Cache prefixes for faster generation + ) + + # Optimized sampling params for speed + self.default_sampling_params = SamplingParams( + temperature=0.7, + max_tokens=1024, # Reduced from 1536 for faster responses + top_p=0.9, + stop=["\n\n\n", "Question:", "Context:", "<|end|>"] + ) + + print("✅ Fast RAG Service ready!") + + @modal.method() + def query(self, question: str, top_k: int = 5, max_tokens: int = 1024): + """Fast query method optimized for <3 second responses""" + import time + start_time = time.time() + + # Retrieve documents + retrieval_start = time.time() + retriever = self.Retriever( + chroma_service=self.chroma_service, + embeddings=self.embeddings, + k=top_k + ) + docs = retriever.get_relevant_documents(question) + retrieval_time = time.time() - retrieval_start + + if not docs: + return { + "answer": "No relevant information found in the product design document.", + "retrieval_time": retrieval_time, + "generation_time": 0, + "total_time": time.time() - start_time, + "sources": [], + "success": False + } + + # Build context (limit size for speed) + context = "\n\n".join([doc.page_content[:800] for doc in docs[:3]]) # Limit to top 3 docs, 800 chars each + + # Create prompt + prompt = f"""<|system|> +You are a helpful AI assistant. Answer questions about the TokyoDrive Insurance product design document concisely and accurately.<|end|> +<|user|> +Context: +{context} + +Question: +{question}<|end|> +<|assistant|>""" + + # Generate with optimized params + from vllm import SamplingParams + sampling_params = SamplingParams( + temperature=0.7, + max_tokens=max_tokens, + top_p=0.9, + stop=["\n\n\n", "Question:", "Context:", "<|end|>"] + ) + + gen_start = time.time() + outputs = self.llm_engine.generate(prompts=[prompt], sampling_params=sampling_params) + answer = outputs[0].outputs[0].text.strip() + generation_time = time.time() - gen_start + + # Prepare sources (limited for speed) + sources = [] + for doc in docs[:3]: # Limit to 3 sources + sources.append({ + "content": doc.page_content[:300], + "metadata": doc.metadata + }) + + total_time = time.time() - start_time + + return { + "answer": answer, + "retrieval_time": retrieval_time, + "generation_time": generation_time, + "total_time": total_time, + "sources": sources, + "success": True + } + +# Deploy as web endpoint +@app.function( + image=image, + volumes={"/insurance-data": vol}, + allow_concurrent_inputs=10, # Handle multiple requests +) +@modal.asgi_app() +def fastapi_app(): + """Deploy FastAPI app - all imports inside to avoid local dependency issues""" + from fastapi import FastAPI, HTTPException + from fastapi.middleware.cors import CORSMiddleware + from pydantic import BaseModel + + # Request/Response models + class QueryRequest(BaseModel): + question: str + top_k: int = 5 + max_tokens: int = 1024 # Reduced for faster responses + + class QueryResponse(BaseModel): + answer: str + retrieval_time: float + generation_time: float + total_time: float + sources: list + success: bool + + # FastAPI app + web_app = FastAPI(title="Product Design RAG API", version="1.0.0") + + # CORS + web_app.add_middleware( + CORSMiddleware, + allow_origins=["*"], + allow_credentials=True, + allow_methods=["*"], + allow_headers=["*"], + ) + + # Initialize RAG service + rag_service = FastRAGService() + + @web_app.get("/health") + async def health(): + """Health check endpoint""" + return {"status": "healthy", "service": "rag-api"} + + @web_app.post("/query", response_model=QueryResponse) + async def query_rag(request: QueryRequest): + """ + Query the RAG system - optimized for <3 second responses + + Args: + question: The question to ask + top_k: Number of documents to retrieve (default: 5) + max_tokens: Maximum tokens in response (default: 1024) + + Returns: + QueryResponse with answer, timing, and sources + """ + try: + result = rag_service.query.remote( + question=request.question, + top_k=request.top_k, + max_tokens=request.max_tokens + ) + + if not result.get("success", True): + raise HTTPException(status_code=404, detail="No relevant information found") + + return QueryResponse(**result) + + except Exception as e: + raise HTTPException(status_code=500, detail=f"Error processing query: {str(e)}") + + @web_app.get("/") + async def root(): + """API root endpoint""" + return { + "service": "Product Design RAG API", + "version": "1.0.0", + "endpoints": { + "health": "/health", + "query": "/query (POST)" + }, + "target_response_time": "<3 seconds" + } + + return web_app diff --git a/scripts/tools/api_endpoint.py b/src/tools/api_endpoint.py similarity index 100% rename from scripts/tools/api_endpoint.py rename to src/tools/api_endpoint.py diff --git a/scripts/tools/api_endpoint_cpu.py b/src/tools/api_endpoint_cpu.py similarity index 100% rename from scripts/tools/api_endpoint_cpu.py rename to src/tools/api_endpoint_cpu.py diff --git a/scripts/tools/ask_model.py b/src/tools/ask_model.py similarity index 100% rename from scripts/tools/ask_model.py rename to src/tools/ask_model.py diff --git a/scripts/tools/debug_list_csv.py b/src/tools/debug_list_csv.py similarity index 100% rename from scripts/tools/debug_list_csv.py rename to src/tools/debug_list_csv.py diff --git a/scripts/tools/eval_finetuned.py b/src/tools/eval_finetuned.py similarity index 100% rename from scripts/tools/eval_finetuned.py rename to src/tools/eval_finetuned.py diff --git a/scripts/tools/fill_product_design.py b/src/tools/fill_product_design.py similarity index 100% rename from scripts/tools/fill_product_design.py rename to src/tools/fill_product_design.py diff --git a/scripts/tools/finetune_modal.py b/src/tools/finetune_modal.py similarity index 100% rename from scripts/tools/finetune_modal.py rename to src/tools/finetune_modal.py diff --git a/scripts/tools/finetune_modal_simple.py b/src/tools/finetune_modal_simple.py similarity index 100% rename from scripts/tools/finetune_modal_simple.py rename to src/tools/finetune_modal_simple.py diff --git a/tests/test_api.py b/tests/test_api.py new file mode 100755 index 0000000000000000000000000000000000000000..f98cc26b7546f94bc2e5a0fce8766be98e7b1dd5 --- /dev/null +++ b/tests/test_api.py @@ -0,0 +1,106 @@ +#!/usr/bin/env python3 +""" +Test the RAG API for <3 second response times +""" + +import sys +import time +from pathlib import Path + +# Add src to path +sys.path.insert(0, str(Path(__file__).parent.parent)) + +from src.rag.api_client import RAGAPIClient + +def test_api_performance(api_url: str = "http://localhost:8000"): + """Test API performance""" + print("="*70) + print("🧪 RAG API Performance Test") + print("="*70) + + client = RAGAPIClient(base_url=api_url) + + # Test 1: Health check + print("\n1. Health Check...") + health = client.health_check() + print(f" Status: {health.get('status', 'unknown')}") + + if health.get("status") != "healthy": + print("❌ API is not healthy. Make sure it's deployed and running.") + return + + # Test 2: Performance test + print("\n2. Performance Test (<3s target)...") + test_questions = [ + "What are the three product tiers?", + "What is the Year 3 premium volume?", + "What coverage does the Standard tier include?", + ] + + results = [] + for i, question in enumerate(test_questions, 1): + print(f"\n Query {i}: {question[:50]}...") + start = time.time() + result = client.query(question) + elapsed = time.time() - start + + if result.get("success"): + total_time = result.get("total_time", elapsed) + retrieval = result.get("retrieval_time", 0) + generation = result.get("generation_time", 0) + + status = "✅" if total_time < 3.0 else "⚠️" + print(f" {status} Total: {total_time:.2f}s (Retrieval: {retrieval:.2f}s, Generation: {generation:.2f}s)") + + if total_time < 3.0: + print(f" ✅ Meets <3s target!") + else: + print(f" ⚠️ Exceeds 3s target by {total_time - 3.0:.2f}s") + + results.append({ + "question": question, + "total_time": total_time, + "retrieval_time": retrieval, + "generation_time": generation, + "success": True + }) + else: + print(f" ❌ Failed: {result.get('error', 'Unknown error')}") + results.append({"success": False}) + + # Summary + print("\n" + "="*70) + print("📊 Performance Summary") + print("="*70) + + successful = [r for r in results if r.get("success")] + if successful: + avg_time = sum(r["total_time"] for r in successful) / len(successful) + fastest = min(r["total_time"] for r in successful) + slowest = max(r["total_time"] for r in successful) + + print(f"Average response time: {avg_time:.2f}s") + print(f"Fastest: {fastest:.2f}s") + print(f"Slowest: {slowest:.2f}s") + print(f"Target: <3.0s") + + if avg_time < 3.0: + print("\n🎉 API meets performance target!") + else: + print(f"\n⚠️ API exceeds target by {avg_time - 3.0:.2f}s on average") + + print("\n" + "="*70) + +if __name__ == "__main__": + import argparse + + parser = argparse.ArgumentParser(description="Test RAG API performance") + parser.add_argument( + "--url", + default="http://localhost:8000", + help="API URL (default: http://localhost:8000)" + ) + + args = parser.parse_args() + test_api_performance(args.url) +