Spaces:

bitsinthesky
/

openai-chatbot-mcp

Sleeping

App Files Files Community

Julian Vanecek commited on Jul 9, 2025

Commit

3151bfa

1 Parent(s): 6edaf19

init

Browse files

Files changed (9) hide show

backend/FAQ_MANAGEMENT.md +77 -0
backend/IMPORTANT_API_CHANGES.md +43 -0
backend/__init__.py +1 -0
backend/add_to_vector_store.py +195 -0
backend/chatbot_backend.py +274 -0
backend/document_reader.py +195 -0
backend/test_pdf_mapping.py +32 -0
backend/upload_versioned_pdfs.py +239 -0
backend/vector_store_manager.py +178 -0

backend/FAQ_MANAGEMENT.md ADDED Viewed

	@@ -0,0 +1,77 @@

+# FAQ Management Guide
+This guide explains how to manage FAQ documents in the OpenAI Chatbot MCP system.
+## Initial Setup (Without FAQ Documents)
+1. **Upgrade OpenAI library**:
+   ```bash
+   pip install --upgrade openai>=1.50.0
+   ```
+2. **Create vector stores (skipping empty FAQ)**:
+   ```bash
+   python backend/upload_versioned_pdfs.py
+   ```
+   This will:
+   - Create vector stores for all versions with PDFs
+   - Skip the general_faq store since no FAQ documents exist yet
+   - Save configuration with actual vector store IDs
+## Adding FAQ Documents Later
+### Option 1: Add to Existing FAQ Store
+If you created an empty FAQ store:
+```bash
+# Add single FAQ document
+python backend/add_to_vector_store.py add general_faq /path/to/faq.pdf
+# Add multiple FAQ documents
+python backend/add_to_vector_store.py add general_faq /path/to/faq1.pdf /path/to/faq2.pdf
+```
+### Option 2: Create FAQ Store First
+If you skipped the FAQ store initially:
+```bash
+# Create the FAQ store
+python backend/add_to_vector_store.py create general_faq \
+  --name "General FAQ and Overview" \
+  --description "General information, FAQs, and cross-version content"
+# Then add documents
+python backend/add_to_vector_store.py add general_faq /path/to/faq.pdf
+```
+## Listing Available Stores
+To see all configured vector stores:
+```bash
+python backend/add_to_vector_store.py list
+```
+## FAQ Document Naming
+For automatic detection in future runs, name FAQ documents with keywords:
+- `faq` - e.g., `product_faq.pdf`
+- `general` - e.g., `general_overview.pdf`
+- `overview` - e.g., `platform_overview.pdf`
+- `comparison` - e.g., `version_comparison.pdf`
+## Full Re-upload with FAQ
+Once you have FAQ documents in the `/pdfs` directory:
+```bash
+# This will detect and upload FAQ documents automatically
+python backend/upload_versioned_pdfs.py
+```
+## Forcing Empty Store Creation
+To create all stores including empty ones:
+```bash
+python backend/upload_versioned_pdfs.py --create-empty
+```
+This is useful if you want all stores ready even without documents.

backend/IMPORTANT_API_CHANGES.md ADDED Viewed

	@@ -0,0 +1,43 @@

+# Important: OpenAI API Changes
+## Vector Stores API Location
+As of OpenAI Python SDK 1.93.x, the vector stores API has moved:
+- **OLD**: `client.beta.vector_stores`
+- **NEW**: `client.vector_stores`
+## How Vector Stores Work
+Vector stores are designed to work with the Assistants API:
+1. **Create vector stores**: `client.vector_stores.create()`
+2. **Upload files to stores**: `client.vector_stores.files.create()`
+3. **Use with assistants**: Vector stores are queried through assistants using the file_search tool
+## The Architecture
+```
+Vector Stores (storage) -> Assistants (query interface) -> Threads (conversations)
+```
+## Current Implementation Status
+1. **upload_versioned_pdfs.py**: ✅ Fixed to use `client.vector_stores`
+2. **add_to_vector_store.py**: ✅ Fixed to use `client.vector_stores`
+3. **vector_store_manager.py**: ❌ Needs assistant creation for querying
+## Next Steps
+To properly use vector stores for querying, you need to:
+1. Create an assistant with file_search capability
+2. Attach vector stores to the assistant
+3. Use threads to query the assistant
+Alternative approach:
+- Use OpenAI embeddings API directly
+- Store embeddings in a local database
+- Implement your own similarity search
+This would avoid the complexity of the Assistants API but requires more implementation work.

backend/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Backend package

backend/add_to_vector_store.py ADDED Viewed

	@@ -0,0 +1,195 @@

+#!/usr/bin/env python3
+"""
+Add documents to existing OpenAI vector stores.
+Useful for adding FAQ documents or updating existing stores.
+"""
+import os
+import json
+import time
+import argparse
+from pathlib import Path
+from typing import List, Optional
+from openai import OpenAI, __version__ as openai_version
+from packaging import version
+import sys
+class VectorStoreUpdater:
+    def __init__(self, api_key: Optional[str] = None):
+        """Initialize the updater with OpenAI client."""
+        # Check OpenAI version
+        if version.parse(openai_version) < version.parse("1.50.0"):
+            print(f"Error: OpenAI library version {openai_version} is too old.")
+            print("Vector stores require version 1.50.0 or higher.")
+            print("Please run: pip install --upgrade openai>=1.50.0")
+            sys.exit(1)
+        self.client = OpenAI(api_key=api_key or os.getenv("OPENAI_API_KEY"))
+        self.config_path = Path(__file__).parent.parent / "config" / "vector_stores.json"
+        self.load_config()
+    def load_config(self):
+        """Load vector store configuration."""
+        if not self.config_path.exists():
+            print(f"Error: Configuration file not found at {self.config_path}")
+            print("Please run upload_versioned_pdfs.py first to create vector stores.")
+            sys.exit(1)
+        with open(self.config_path, 'r') as f:
+            self.config = json.load(f)
+            self.vector_stores = self.config.get('vector_stores', {})
+    def list_stores(self):
+        """List all available vector stores."""
+        print("\nAvailable vector stores:")
+        for store_name, store_id in self.vector_stores.items():
+            print(f"  - {store_name}: {store_id}")
+    def add_file_to_store(self, store_name: str, file_path: Path) -> bool:
+        """Add a file to an existing vector store."""
+        if store_name not in self.vector_stores:
+            print(f"Error: Vector store '{store_name}' not found.")
+            self.list_stores()
+            return False
+        store_id = self.vector_stores[store_name]
+        print(f"Adding {file_path.name} to {store_name} ({store_id})...")
+        try:
+            # Upload file
+            with open(file_path, "rb") as file:
+                file_upload = self.client.files.create(
+                    file=file,
+                    purpose="assistants"
+                )
+            # Add file to vector store
+            self.client.vector_stores.files.create(
+                vector_store_id=store_id,
+                file_id=file_upload.id
+            )
+            # Wait for processing
+            while True:
+                file_status = self.client.vector_stores.files.retrieve(
+                    vector_store_id=store_id,
+                    file_id=file_upload.id
+                )
+                if file_status.status == "completed":
+                    print(f"✓ Successfully added {file_path.name}")
+                    return True
+                elif file_status.status == "failed":
+                    print(f"✗ Failed to process {file_path.name}")
+                    return False
+                time.sleep(2)
+        except Exception as e:
+            print(f"✗ Error adding file: {str(e)}")
+            return False
+    def add_multiple_files(self, store_name: str, file_paths: List[Path]):
+        """Add multiple files to a vector store."""
+        if not file_paths:
+            print("No files to add.")
+            return
+        print(f"\nAdding {len(file_paths)} files to {store_name}...")
+        success_count = 0
+        for file_path in file_paths:
+            if self.add_file_to_store(store_name, file_path):
+                success_count += 1
+        print(f"\n✓ Successfully added {success_count}/{len(file_paths)} files")
+    def create_empty_store(self, store_name: str, name: str, description: str) -> Optional[str]:
+        """Create a new empty vector store."""
+        if store_name in self.vector_stores:
+            print(f"Error: Vector store '{store_name}' already exists.")
+            return None
+        print(f"Creating new vector store: {name}")
+        # Note: description parameter no longer supported, storing it in config instead
+        try:
+            vector_store = self.client.vector_stores.create(
+                name=name
+            )
+            # Update config
+            self.vector_stores[store_name] = vector_store.id
+            self.config['vector_stores'] = self.vector_stores
+            # Store description in config since API no longer supports it
+            if 'descriptions' not in self.config:
+                self.config['descriptions'] = {}
+            self.config['descriptions'][store_name] = description
+            with open(self.config_path, 'w') as f:
+                json.dump(self.config, f, indent=2)
+            print(f"✓ Created vector store: {store_name} ({vector_store.id})")
+            return vector_store.id
+        except Exception as e:
+            print(f"✗ Error creating vector store: {str(e)}")
+            return None
+def main():
+    """Main function."""
+    parser = argparse.ArgumentParser(description="Add documents to OpenAI vector stores")
+    subparsers = parser.add_subparsers(dest='command', help='Commands')
+    # List command
+    list_parser = subparsers.add_parser('list', help='List available vector stores')
+    # Add command
+    add_parser = subparsers.add_parser('add', help='Add files to a vector store')
+    add_parser.add_argument('store_name', help='Name of the vector store (e.g., general_faq)')
+    add_parser.add_argument('files', nargs='+', help='Files to add')
+    # Create command
+    create_parser = subparsers.add_parser('create', help='Create a new empty vector store')
+    create_parser.add_argument('store_name', help='Internal name (e.g., general_faq)')
+    create_parser.add_argument('--name', required=True, help='Display name')
+    create_parser.add_argument('--description', required=True, help='Description')
+    args = parser.parse_args()
+    if not args.command:
+        parser.print_help()
+        return
+    # Check for API key
+    if not os.getenv("OPENAI_API_KEY"):
+        print("Error: OPENAI_API_KEY environment variable not set")
+        return
+    updater = VectorStoreUpdater()
+    if args.command == 'list':
+        updater.list_stores()
+    elif args.command == 'add':
+        # Convert file paths
+        file_paths = []
+        for file_arg in args.files:
+            file_path = Path(file_arg)
+            if not file_path.exists():
+                print(f"Warning: File not found: {file_path}")
+            else:
+                file_paths.append(file_path)
+        if file_paths:
+            updater.add_multiple_files(args.store_name, file_paths)
+        else:
+            print("No valid files to add.")
+    elif args.command == 'create':
+        updater.create_empty_store(args.store_name, args.name, args.description)
+if __name__ == "__main__":
+    main()

backend/chatbot_backend.py ADDED Viewed

	@@ -0,0 +1,274 @@

+"""
+OpenAI Chatbot Backend with Multi-Vector Store Support and MCP-style Tools
+"""
+import os
+import json
+import time
+import logging
+from typing import Dict, List, Optional, Tuple, Generator
+from pathlib import Path
+from openai import OpenAI
+import tiktoken
+from .vector_store_manager import VectorStoreManager
+from .document_reader import DocumentReader
+from ..tools.vector_search_tool import (
+    get_vector_search_tool_definition,
+    execute_vector_search,
+    format_search_results_for_context
+)
+from ..tools.document_reader_tool import (
+    get_document_reader_tool_definition,
+    execute_document_read,
+    format_document_content_for_context
+)
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class ChatbotBackend:
+    def __init__(self, api_key: Optional[str] = None):
+        """Initialize the chatbot backend."""
+        self.client = OpenAI(api_key=api_key or os.getenv("OPENAI_API_KEY"))
+        self.vector_store_manager = VectorStoreManager(self.client)
+        self.document_reader = DocumentReader()
+        # Load configuration
+        config_path = Path(__file__).parent.parent / "config" / "openai_config.json"
+        with open(config_path, 'r') as f:
+            self.config = json.load(f)
+        # Initialize tokenizer for token counting
+        self.encoding = tiktoken.encoding_for_model("gpt-4o")
+        # Define available tools
+        self.tools = [
+            get_vector_search_tool_definition(),
+            get_document_reader_tool_definition()
+        ]
+    def count_tokens(self, text: str) -> int:
+        """Count tokens in text."""
+        return len(self.encoding.encode(text))
+    def query_with_version(self, query: str, product: str, version: str,
+                         custom_prompt: Optional[str] = None,
+                         model: str = "gpt-4o",
+                         temperature: float = 0.7,
+                         max_tokens: int = 4000) -> Generator[Dict, None, None]:
+        """
+        Query the chatbot with automatic version-specific and general context.
+        Yields streaming responses.
+        """
+        start_time = time.time()
+        # Query both version-specific and general vector stores
+        version_results, general_results = self.vector_store_manager.query_version_and_general(
+            product, version, query, max_results=self.config.get("max_chunks", 10)
+        )
+        # Format context from vector store results
+        context = self.vector_store_manager.format_search_results(
+            version_results, general_results, product, version
+        )
+        # Build the enhanced query
+        enhanced_query = f"{context}\n\nUser Question: {query}"
+        # Add custom prompt if provided
+        if custom_prompt:
+            enhanced_query = f"{custom_prompt}\n\n{enhanced_query}"
+        # Create messages
+        messages = [
+            {
+                "role": "system",
+                "content": (
+                    f"You are an expert assistant for {product.capitalize()} version {version}. "
+                    f"You have access to version-specific documentation and general information. "
+                    f"You can use the provided tools to search for more information or read specific document pages. "
+                    f"Always provide accurate, version-specific answers based on the documentation."
+                )
+            },
+            {"role": "user", "content": enhanced_query}
+        ]
+        # Count input tokens
+        input_tokens = sum(self.count_tokens(msg["content"]) for msg in messages)
+        # Stream response with function calling
+        try:
+            stream = self.client.chat.completions.create(
+                model=model,
+                messages=messages,
+                temperature=temperature,
+                max_tokens=max_tokens,
+                stream=True,
+                tools=self.tools,
+                tool_choice="auto"
+            )
+            # Track usage
+            output_tokens = 0
+            full_response = ""
+            tool_calls = []
+            current_tool_call = None
+            for chunk in stream:
+                delta = chunk.choices[0].delta
+                # Handle tool calls
+                if delta.tool_calls:
+                    for tool_call_delta in delta.tool_calls:
+                        if tool_call_delta.id:
+                            # New tool call
+                            if current_tool_call:
+                                tool_calls.append(current_tool_call)
+                            current_tool_call = {
+                                "id": tool_call_delta.id,
+                                "type": "function",
+                                "function": {
+                                    "name": tool_call_delta.function.name if tool_call_delta.function else "",
+                                    "arguments": ""
+                                }
+                            }
+                        if tool_call_delta.function and tool_call_delta.function.arguments:
+                            current_tool_call["function"]["arguments"] += tool_call_delta.function.arguments
+                # Handle regular content
+                if delta.content:
+                    output_tokens += self.count_tokens(delta.content)
+                    full_response += delta.content
+                    yield {
+                        "type": "content",
+                        "content": delta.content,
+                        "done": False
+                    }
+                # Check if stream is done
+                if chunk.choices[0].finish_reason == "tool_calls":
+                    # Add the last tool call
+                    if current_tool_call:
+                        tool_calls.append(current_tool_call)
+                    # Execute tool calls
+                    tool_results = self._execute_tool_calls(tool_calls)
+                    # Continue conversation with tool results
+                    messages.append({
+                        "role": "assistant",
+                        "content": full_response,
+                        "tool_calls": tool_calls
+                    })
+                    for tool_result in tool_results:
+                        messages.append({
+                            "role": "tool",
+                            "tool_call_id": tool_result["tool_call_id"],
+                            "content": tool_result["content"]
+                        })
+                    # Get follow-up response
+                    follow_up_stream = self.client.chat.completions.create(
+                        model=model,
+                        messages=messages,
+                        temperature=temperature,
+                        max_tokens=max_tokens,
+                        stream=True
+                    )
+                    for follow_up_chunk in follow_up_stream:
+                        if follow_up_chunk.choices[0].delta.content:
+                            content = follow_up_chunk.choices[0].delta.content
+                            output_tokens += self.count_tokens(content)
+                            full_response += content
+                            yield {
+                                "type": "content",
+                                "content": content,
+                                "done": False
+                            }
+            # Calculate final metrics
+            end_time = time.time()
+            response_time = end_time - start_time
+            # Calculate costs
+            model_info = self.config["models"].get(model, {})
+            input_cost = (input_tokens / 1_000_000) * model_info.get("input_cost", 0)
+            output_cost = (output_tokens / 1_000_000) * model_info.get("output_cost", 0)
+            total_cost = input_cost + output_cost
+            # Yield final metadata
+            yield {
+                "type": "metadata",
+                "done": True,
+                "usage": {
+                    "input_tokens": input_tokens,
+                    "output_tokens": output_tokens,
+                    "total_tokens": input_tokens + output_tokens
+                },
+                "cost": {
+                    "input": round(input_cost, 4),
+                    "output": round(output_cost, 4),
+                    "total": round(total_cost, 4)
+                },
+                "response_time": round(response_time, 2),
+                "model": model,
+                "version_context": f"{product.capitalize()} {version}"
+            }
+        except Exception as e:
+            logger.error(f"Error in chat completion: {str(e)}")
+            yield {
+                "type": "error",
+                "error": str(e),
+                "done": True
+            }
+    def _execute_tool_calls(self, tool_calls: List[Dict]) -> List[Dict]:
+        """Execute tool calls and return results."""
+        results = []
+        for tool_call in tool_calls:
+            function_name = tool_call["function"]["name"]
+            arguments = json.loads(tool_call["function"]["arguments"])
+            if function_name == "search_vector_store":
+                result = execute_vector_search(
+                    self.vector_store_manager,
+                    arguments["query"],
+                    arguments["vector_store_name"],
+                    arguments.get("max_results", 5)
+                )
+                content = format_search_results_for_context(result)
+            elif function_name == "read_document_pages":
+                result = execute_document_read(
+                    self.document_reader,
+                    arguments["document_name"],
+                    arguments.get("page_numbers")
+                )
+                content = format_document_content_for_context(result)
+            else:
+                content = f"Unknown function: {function_name}"
+            results.append({
+                "tool_call_id": tool_call["id"],
+                "content": content
+            })
+        return results
+    def get_available_versions(self) -> Dict[str, List[str]]:
+        """Get all available product versions."""
+        return self.vector_store_manager.list_available_versions()
+    def get_available_models(self) -> Dict[str, Dict]:
+        """Get available models and their information."""
+        return self.config["models"]

backend/document_reader.py ADDED Viewed

	@@ -0,0 +1,195 @@

+"""
+Document Reader for page-level document access.
+"""
+import os
+import json
+from typing import List, Optional, Dict, Union
+from pathlib import Path
+import logging
+logger = logging.getLogger(__name__)
+class DocumentReader:
+    def __init__(self, pages_dir: Optional[Path] = None):
+        """Initialize the document reader."""
+        self.pages_dir = pages_dir or Path(__file__).parent.parent / "pages"
+        self.document_index = self._load_document_index()
+    def _load_document_index(self) -> Dict:
+        """Load document index if available."""
+        index_path = self.pages_dir / "document_index.json"
+        if index_path.exists():
+            try:
+                with open(index_path, 'r') as f:
+                    return json.load(f)
+            except Exception as e:
+                logger.error(f"Error loading document index: {e}")
+        return {}
+    def _normalize_document_name(self, document_name: str) -> str:
+        """Normalize document name for consistent file matching."""
+        # Remove common prefixes/suffixes
+        name = document_name.strip()
+        name = name.replace(" ", "_")
+        name = name.replace(".", "_")
+        # Handle different formats
+        if not name.endswith(("UserGuide", "InstallationGuide", "QuickStartGuide")):
+            # Try to identify the document type
+            if "user" in name.lower() and "guide" in name.lower():
+                if not name.endswith("UserGuide"):
+                    name = name.replace("User_Guide", "UserGuide")
+            elif "installation" in name.lower() and "guide" in name.lower():
+                if not name.endswith("InstallationGuide"):
+                    name = name.replace("Installation_Guide", "InstallationGuide")
+            elif "quick" in name.lower() and "start" in name.lower():
+                if not name.endswith("QuickStartGuide"):
+                    name = name.replace("Quick_Start_Guide", "QuickStartGuide")
+        return name
+    def get_table_of_contents(self, document_name: str) -> Optional[str]:
+        """Get the table of contents for a document."""
+        normalized_name = self._normalize_document_name(document_name)
+        toc_filename = f"{normalized_name}_TOC.txt"
+        toc_path = self.pages_dir / toc_filename
+        if not toc_path.exists():
+            # Try alternative naming conventions
+            alternatives = [
+                f"{document_name}_TOC.txt",
+                f"{document_name.replace(' ', '_')}_TOC.txt",
+                f"{document_name.replace('.', '_')}_TOC.txt"
+            ]
+            for alt in alternatives:
+                alt_path = self.pages_dir / alt
+                if alt_path.exists():
+                    toc_path = alt_path
+                    break
+        if toc_path.exists():
+            try:
+                with open(toc_path, 'r', encoding='utf-8') as f:
+                    return f.read()
+            except Exception as e:
+                logger.error(f"Error reading TOC file {toc_path}: {e}")
+                return None
+        logger.warning(f"TOC file not found for document: {document_name}")
+        return None
+    def read_pages(self, document_name: str, page_numbers: Optional[List[int]] = None) -> Union[str, Dict[int, str]]:
+        """
+        Read specific pages from a document.
+        If page_numbers is None, returns the table of contents.
+        """
+        if page_numbers is None:
+            # Return table of contents
+            toc = self.get_table_of_contents(document_name)
+            if toc:
+                return f"Table of Contents for {document_name}:\n\n{toc}"
+            else:
+                return f"Table of contents not found for document: {document_name}"
+        # Read specific pages
+        normalized_name = self._normalize_document_name(document_name)
+        pages_content = {}
+        for page_num in page_numbers:
+            page_filename = f"{normalized_name}_page_{page_num:03d}.txt"
+            page_path = self.pages_dir / page_filename
+            if not page_path.exists():
+                # Try alternative formats
+                alternatives = [
+                    f"{document_name}_page_{page_num:03d}.txt",
+                    f"{document_name.replace(' ', '_')}_page_{page_num:03d}.txt",
+                    f"{document_name.replace('.', '_')}_page_{page_num:03d}.txt"
+                ]
+                for alt in alternatives:
+                    alt_path = self.pages_dir / alt
+                    if alt_path.exists():
+                        page_path = alt_path
+                        break
+            if page_path.exists():
+                try:
+                    with open(page_path, 'r', encoding='utf-8') as f:
+                        pages_content[page_num] = f.read()
+                except Exception as e:
+                    logger.error(f"Error reading page {page_num} from {document_name}: {e}")
+                    pages_content[page_num] = f"Error reading page {page_num}"
+            else:
+                pages_content[page_num] = f"Page {page_num} not found"
+        # Format the output
+        if len(pages_content) == 1:
+            page_num = list(pages_content.keys())[0]
+            return f"Page {page_num} of {document_name}:\n\n{pages_content[page_num]}"
+        else:
+            formatted_pages = []
+            for page_num in sorted(pages_content.keys()):
+                formatted_pages.append(f"=== Page {page_num} ===\n{pages_content[page_num]}")
+            return f"Pages from {document_name}:\n\n" + "\n\n".join(formatted_pages)
+    def list_available_documents(self) -> List[str]:
+        """List all available documents."""
+        documents = set()
+        # Scan for TOC files
+        for toc_file in self.pages_dir.glob("*_TOC.txt"):
+            doc_name = toc_file.stem.replace("_TOC", "")
+            documents.add(doc_name)
+        # Also check document index
+        if self.document_index:
+            documents.update(self.document_index.keys())
+        return sorted(list(documents))
+    def get_document_info(self, document_name: str) -> Dict[str, any]:
+        """Get information about a document (number of pages, etc.)."""
+        normalized_name = self._normalize_document_name(document_name)
+        info = {
+            "name": document_name,
+            "normalized_name": normalized_name,
+            "has_toc": False,
+            "page_count": 0,
+            "available_pages": []
+        }
+        # Check for TOC
+        toc_path = self.pages_dir / f"{normalized_name}_TOC.txt"
+        info["has_toc"] = toc_path.exists()
+        # Count pages
+        page_pattern = f"{normalized_name}_page_*.txt"
+        page_files = list(self.pages_dir.glob(page_pattern))
+        if not page_files:
+            # Try alternative patterns
+            for alt_pattern in [f"{document_name}_page_*.txt",
+                              f"{document_name.replace(' ', '_')}_page_*.txt"]:
+                page_files = list(self.pages_dir.glob(alt_pattern))
+                if page_files:
+                    break
+        if page_files:
+            page_numbers = []
+            for page_file in page_files:
+                try:
+                    # Extract page number from filename
+                    page_num_str = page_file.stem.split("_page_")[-1]
+                    page_num = int(page_num_str)
+                    page_numbers.append(page_num)
+                except:
+                    pass
+            info["page_count"] = len(page_numbers)
+            info["available_pages"] = sorted(page_numbers)
+        return info

backend/test_pdf_mapping.py ADDED Viewed

	@@ -0,0 +1,32 @@

+#!/usr/bin/env python3
+"""Test script to verify PDF mapping before uploading."""
+from upload_versioned_pdfs import VectorStoreUploader
+from pathlib import Path
+def main():
+    """Test PDF file detection and mapping."""
+    uploader = VectorStoreUploader()
+    print("PDF Directory:", uploader.pdf_directory)
+    print("Directory exists:", uploader.pdf_directory.exists())
+    print()
+    if uploader.pdf_directory.exists():
+        all_pdfs = list(uploader.pdf_directory.glob("*.pdf"))
+        print(f"Total PDFs found: {len(all_pdfs)}")
+        print("\nAll PDF files:")
+        for pdf in sorted(all_pdfs):
+            print(f"  - {pdf.name}")
+        print()
+    pdf_mapping = uploader.get_pdf_files()
+    print("\nPDF Mapping by Version:")
+    for store_name, pdf_files in pdf_mapping.items():
+        print(f"\n{store_name}: ({len(pdf_files)} files)")
+        for pdf in pdf_files:
+            print(f"  - {pdf.name}")
+if __name__ == "__main__":
+    main()

backend/upload_versioned_pdfs.py ADDED Viewed

	@@ -0,0 +1,239 @@

+#!/usr/bin/env python3
+"""
+Upload versioned PDFs to separate OpenAI vector stores.
+Creates one vector store per version and one general/FAQ store.
+"""
+import os
+import json
+import time
+import sys
+from pathlib import Path
+from typing import Dict, List, Optional
+from openai import OpenAI, __version__ as openai_version
+from datetime import datetime
+from packaging import version
+class VectorStoreUploader:
+    def __init__(self, api_key: Optional[str] = None, skip_empty: bool = True):
+        """Initialize the uploader with OpenAI client.
+        Args:
+            api_key: OpenAI API key
+            skip_empty: Skip creation of empty vector stores
+        """
+        # Check OpenAI version
+        if version.parse(openai_version) < version.parse("1.50.0"):
+            print(f"Error: OpenAI library version {openai_version} is too old.")
+            print("Vector stores require version 1.50.0 or higher.")
+            print("Please run: pip install --upgrade openai>=1.50.0")
+            sys.exit(1)
+        self.client = OpenAI(api_key=api_key or os.getenv("OPENAI_API_KEY"))
+        self.config_path = Path(__file__).parent.parent / "config" / "vector_stores.json"
+        self.pdf_directory = Path("/Users/jsv/Work/ataya/concert-master/pdfs")
+        self.skip_empty = skip_empty
+    def create_vector_store(self, name: str, description: str) -> str:
+        """Create a new vector store and return its ID."""
+        print(f"Creating vector store: {name}")
+        # Note: description parameter no longer supported in API
+        vector_store = self.client.vector_stores.create(
+            name=name
+        )
+        return vector_store.id
+    def upload_file_to_store(self, vector_store_id: str, file_path: Path) -> str:
+        """Upload a file to a vector store."""
+        print(f"  Uploading {file_path.name}...")
+        # Upload file
+        with open(file_path, "rb") as file:
+            file_upload = self.client.files.create(
+                file=file,
+                purpose="assistants"
+            )
+        # Add file to vector store
+        self.client.vector_stores.files.create(
+            vector_store_id=vector_store_id,
+            file_id=file_upload.id
+        )
+        # Wait for processing
+        while True:
+            file_status = self.client.vector_stores.files.retrieve(
+                vector_store_id=vector_store_id,
+                file_id=file_upload.id
+            )
+            if file_status.status == "completed":
+                print(f"    ✓ {file_path.name} processed successfully")
+                break
+            elif file_status.status == "failed":
+                print(f"    ✗ {file_path.name} failed to process")
+                break
+            time.sleep(2)
+        return file_upload.id
+    def get_pdf_files(self) -> Dict[str, List[Path]]:
+        """Organize PDF files by version."""
+        pdf_mapping = {
+            "harmony_1_2": [],
+            "harmony_1_5": [],
+            "harmony_1_6": [],
+            "harmony_1_8": [],
+            "chorus_1_1": [],
+            "general_faq": []
+        }
+        if not self.pdf_directory.exists():
+            print(f"PDF directory not found: {self.pdf_directory}")
+            return pdf_mapping
+        # Map file patterns to versions
+        for pdf_file in self.pdf_directory.glob("*.pdf"):
+            filename = pdf_file.name.lower()
+            # Check for Harmony versions
+            if "harmony" in filename:
+                if "1.2" in filename or "r1.2" in filename:
+                    pdf_mapping["harmony_1_2"].append(pdf_file)
+                elif "1.5" in filename or "r1.5" in filename:
+                    pdf_mapping["harmony_1_5"].append(pdf_file)
+                elif "1.6" in filename or "r1.6" in filename:
+                    pdf_mapping["harmony_1_6"].append(pdf_file)
+                elif "1.8" in filename or "r1.8" in filename:
+                    pdf_mapping["harmony_1_8"].append(pdf_file)
+            # Check for Chorus versions
+            elif "chorus" in filename:
+                if "1.1" in filename or "r1.1" in filename:
+                    pdf_mapping["chorus_1_1"].append(pdf_file)
+            # General/FAQ documents
+            elif any(keyword in filename for keyword in ["faq", "general", "overview", "comparison"]):
+                pdf_mapping["general_faq"].append(pdf_file)
+        return pdf_mapping
+    def upload_all_pdfs(self):
+        """Create vector stores and upload all PDFs."""
+        pdf_mapping = self.get_pdf_files()
+        vector_stores = {}
+        descriptions = {}
+        # Create vector stores and upload files
+        for store_name, pdf_files in pdf_mapping.items():
+            if not pdf_files:
+                if self.skip_empty:
+                    print(f"\nNo PDFs found for {store_name}, skipping...")
+                    continue
+                else:
+                    print(f"\nNo PDFs found for {store_name}, but creating empty store...")
+            # Create descriptive name and description
+            if store_name == "general_faq":
+                name = "General FAQ and Overview"
+                description = "General information, FAQs, and cross-version content"
+            else:
+                product, version = store_name.split("_", 1)
+                version_display = version.replace("_", ".")
+                name = f"{product.capitalize()} {version_display}"
+                description = f"Documentation for {product.capitalize()} version {version_display}"
+            # Create vector store
+            vector_store_id = self.create_vector_store(name, description)
+            vector_stores[store_name] = vector_store_id
+            descriptions[store_name] = description
+            # Upload files
+            print(f"\nUploading {len(pdf_files)} files to {name}:")
+            for pdf_file in pdf_files:
+                self.upload_file_to_store(vector_store_id, pdf_file)
+        # Save configuration
+        self.save_config(vector_stores, descriptions)
+        return vector_stores
+    def save_config(self, vector_stores: Dict[str, str], descriptions: Dict[str, str]):
+        """Save vector store configuration."""
+        config = {
+            "vector_stores": vector_stores,
+            "descriptions": descriptions,
+            "latest_versions": {
+                "harmony": "1.8",
+                "chorus": "1.1"
+            },
+            "created_at": datetime.now().isoformat(),
+            "chunk_size": 1000,
+            "max_chunks": 10
+        }
+        # Ensure config directory exists
+        self.config_path.parent.mkdir(parents=True, exist_ok=True)
+        # Save configuration
+        with open(self.config_path, "w") as f:
+            json.dump(config, f, indent=2)
+        print(f"\nConfiguration saved to: {self.config_path}")
+        print(json.dumps(config, indent=2))
+def main():
+    """Main function to run the upload process."""
+    import argparse
+    parser = argparse.ArgumentParser(description="Upload PDFs to OpenAI vector stores")
+    parser.add_argument(
+        "--create-empty",
+        action="store_true",
+        help="Create empty vector stores even if no PDFs are found"
+    )
+    parser.add_argument(
+        "--no-confirm",
+        action="store_true",
+        help="Skip confirmation prompt"
+    )
+    args = parser.parse_args()
+    print("OpenAI Chatbot MCP - Vector Store Setup")
+    print("=" * 50)
+    # Check for API key
+    if not os.getenv("OPENAI_API_KEY"):
+        print("Error: OPENAI_API_KEY environment variable not set")
+        return
+    # Create uploader and run
+    uploader = VectorStoreUploader(skip_empty=not args.create_empty)
+    # First, let's check what PDFs we have
+    print("\nScanning for PDF files...")
+    pdf_mapping = uploader.get_pdf_files()
+    print("\nFound PDFs:")
+    for store_name, pdf_files in pdf_mapping.items():
+        print(f"\n{store_name}:")
+        for pdf in pdf_files:
+            print(f"  - {pdf.name}")
+    # Confirm before proceeding
+    if not args.no_confirm:
+        response = input("\nProceed with vector store creation? (yes/no): ")
+        if response.lower() != "yes":
+            print("Aborted.")
+            return
+    # Upload all PDFs
+    vector_stores = uploader.upload_all_pdfs()
+    print("\n✅ Vector store setup complete!")
+    print(f"Created {len(vector_stores)} vector stores")
+if __name__ == "__main__":
+    main()

backend/vector_store_manager.py ADDED Viewed

	@@ -0,0 +1,178 @@

+"""
+Vector Store Manager for handling multiple version-specific vector stores.
+"""
+import os
+import json
+from typing import Dict, List, Optional, Tuple
+from pathlib import Path
+from openai import OpenAI
+import logging
+logger = logging.getLogger(__name__)
+class VectorStoreManager:
+    def __init__(self, client: OpenAI, config_path: Optional[Path] = None):
+        """Initialize the vector store manager."""
+        self.client = client
+        self.config_path = config_path or Path(__file__).parent.parent / "config" / "vector_stores.json"
+        self.vector_stores = {}
+        self.latest_versions = {}
+        self.load_config()
+    def load_config(self):
+        """Load vector store configuration from file."""
+        if not self.config_path.exists():
+            logger.warning(f"Vector store config not found at {self.config_path}")
+            return
+        try:
+            with open(self.config_path, 'r') as f:
+                config = json.load(f)
+                self.vector_stores = config.get('vector_stores', {})
+                self.latest_versions = config.get('latest_versions', {})
+                logger.info(f"Loaded {len(self.vector_stores)} vector stores from config")
+        except Exception as e:
+            logger.error(f"Error loading vector store config: {e}")
+    def get_store_name_from_version(self, product: str, version: str) -> str:
+        """Convert product and version to store name."""
+        # Normalize version (e.g., "1.8" -> "1_8")
+        version_normalized = version.replace(".", "_")
+        return f"{product.lower()}_{version_normalized}"
+    def get_vector_store_id(self, store_name: str) -> Optional[str]:
+        """Get vector store ID by name."""
+        return self.vector_stores.get(store_name)
+    def query_vector_store(self, store_name: str, query: str, max_results: int = 5) -> List[Dict]:
+        """Query a specific vector store."""
+        store_id = self.get_vector_store_id(store_name)
+        if not store_id:
+            logger.warning(f"Vector store '{store_name}' not found")
+            return []
+        # Check for placeholder IDs
+        if store_id.startswith("vs_PLACEHOLDER"):
+            logger.warning(f"Vector store '{store_name}' has placeholder ID: {store_id}")
+            logger.warning("Please run upload_versioned_pdfs.py to create actual vector stores")
+            return []
+        try:
+            # Create a thread for the query
+            thread = self.client.beta.threads.create()
+            # Add the query as a message
+            self.client.beta.threads.messages.create(
+                thread_id=thread.id,
+                role="user",
+                content=query
+            )
+            # Run the assistant with the specific vector store
+            run = self.client.beta.threads.runs.create_and_poll(
+                thread_id=thread.id,
+                assistant_id="asst_temp",  # This will be replaced with actual assistant ID
+                tools=[{"type": "file_search"}],
+                tool_resources={
+                    "file_search": {
+                        "vector_store_ids": [store_id]
+                    }
+                }
+            )
+            # Get the messages
+            messages = self.client.beta.threads.messages.list(
+                thread_id=thread.id,
+                order="asc"
+            )
+            # Extract search results
+            results = []
+            for message in messages:
+                if message.role == "assistant":
+                    for content in message.content:
+                        if content.type == "text":
+                            # Parse file search annotations
+                            annotations = content.text.annotations
+                            for annotation in annotations:
+                                if annotation.type == "file_citation":
+                                    results.append({
+                                        "text": annotation.text,
+                                        "file_id": annotation.file_citation.file_id,
+                                        "quote": annotation.file_citation.quote
+                                    })
+            return results[:max_results]
+        except Exception as e:
+            logger.error(f"Error querying vector store '{store_name}': {e}")
+            return []
+    def query_version_and_general(self, product: str, version: str, query: str, max_results: int = 5) -> Tuple[List[Dict], List[Dict]]:
+        """Query both version-specific and general vector stores."""
+        # Query version-specific store
+        store_name = self.get_store_name_from_version(product, version)
+        version_results = self.query_vector_store(store_name, query, max_results)
+        # Query general/FAQ store
+        general_results = self.query_vector_store("general_faq", query, max_results)
+        return version_results, general_results
+    def search_across_stores(self, query: str, store_names: Optional[List[str]] = None, max_results_per_store: int = 3) -> Dict[str, List[Dict]]:
+        """Search across multiple vector stores."""
+        if store_names is None:
+            store_names = list(self.vector_stores.keys())
+        results = {}
+        for store_name in store_names:
+            if store_name in self.vector_stores:
+                store_results = self.query_vector_store(store_name, query, max_results_per_store)
+                if store_results:
+                    results[store_name] = store_results
+        return results
+    def get_latest_version(self, product: str) -> Optional[str]:
+        """Get the latest version for a product."""
+        return self.latest_versions.get(product.lower())
+    def list_available_versions(self) -> Dict[str, List[str]]:
+        """List all available product versions."""
+        versions = {"harmony": [], "chorus": []}
+        for store_name in self.vector_stores.keys():
+            if store_name == "general_faq":
+                continue
+            parts = store_name.split("_", 1)
+            if len(parts) == 2:
+                product, version = parts
+                version_display = version.replace("_", ".")
+                if product in versions:
+                    versions[product].append(version_display)
+        # Sort versions
+        for product in versions:
+            versions[product].sort(key=lambda x: [int(p) for p in x.split(".")])
+        return versions
+    def format_search_results(self, version_results: List[Dict], general_results: List[Dict], product: str, version: str) -> str:
+        """Format search results for appending to user query."""
+        formatted = []
+        if version_results:
+            formatted.append(f"Based on {product.capitalize()} {version} documentation:")
+            for i, result in enumerate(version_results, 1):
+                formatted.append(f"{i}. {result.get('quote', result.get('text', ''))}")
+            formatted.append("")
+        if general_results:
+            formatted.append("Additional general information:")
+            for i, result in enumerate(general_results, 1):
+                formatted.append(f"{i}. {result.get('quote', result.get('text', ''))}")
+        return "\n".join(formatted)