Spaces:
Sleeping
Sleeping
Julian Vanecek commited on
Commit ·
3151bfa
1
Parent(s): 6edaf19
init
Browse files- backend/FAQ_MANAGEMENT.md +77 -0
- backend/IMPORTANT_API_CHANGES.md +43 -0
- backend/__init__.py +1 -0
- backend/add_to_vector_store.py +195 -0
- backend/chatbot_backend.py +274 -0
- backend/document_reader.py +195 -0
- backend/test_pdf_mapping.py +32 -0
- backend/upload_versioned_pdfs.py +239 -0
- backend/vector_store_manager.py +178 -0
backend/FAQ_MANAGEMENT.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# FAQ Management Guide
|
| 2 |
+
|
| 3 |
+
This guide explains how to manage FAQ documents in the OpenAI Chatbot MCP system.
|
| 4 |
+
|
| 5 |
+
## Initial Setup (Without FAQ Documents)
|
| 6 |
+
|
| 7 |
+
1. **Upgrade OpenAI library**:
|
| 8 |
+
```bash
|
| 9 |
+
pip install --upgrade openai>=1.50.0
|
| 10 |
+
```
|
| 11 |
+
|
| 12 |
+
2. **Create vector stores (skipping empty FAQ)**:
|
| 13 |
+
```bash
|
| 14 |
+
python backend/upload_versioned_pdfs.py
|
| 15 |
+
```
|
| 16 |
+
This will:
|
| 17 |
+
- Create vector stores for all versions with PDFs
|
| 18 |
+
- Skip the general_faq store since no FAQ documents exist yet
|
| 19 |
+
- Save configuration with actual vector store IDs
|
| 20 |
+
|
| 21 |
+
## Adding FAQ Documents Later
|
| 22 |
+
|
| 23 |
+
### Option 1: Add to Existing FAQ Store
|
| 24 |
+
|
| 25 |
+
If you created an empty FAQ store:
|
| 26 |
+
```bash
|
| 27 |
+
# Add single FAQ document
|
| 28 |
+
python backend/add_to_vector_store.py add general_faq /path/to/faq.pdf
|
| 29 |
+
|
| 30 |
+
# Add multiple FAQ documents
|
| 31 |
+
python backend/add_to_vector_store.py add general_faq /path/to/faq1.pdf /path/to/faq2.pdf
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
### Option 2: Create FAQ Store First
|
| 35 |
+
|
| 36 |
+
If you skipped the FAQ store initially:
|
| 37 |
+
```bash
|
| 38 |
+
# Create the FAQ store
|
| 39 |
+
python backend/add_to_vector_store.py create general_faq \
|
| 40 |
+
--name "General FAQ and Overview" \
|
| 41 |
+
--description "General information, FAQs, and cross-version content"
|
| 42 |
+
|
| 43 |
+
# Then add documents
|
| 44 |
+
python backend/add_to_vector_store.py add general_faq /path/to/faq.pdf
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## Listing Available Stores
|
| 48 |
+
|
| 49 |
+
To see all configured vector stores:
|
| 50 |
+
```bash
|
| 51 |
+
python backend/add_to_vector_store.py list
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## FAQ Document Naming
|
| 55 |
+
|
| 56 |
+
For automatic detection in future runs, name FAQ documents with keywords:
|
| 57 |
+
- `faq` - e.g., `product_faq.pdf`
|
| 58 |
+
- `general` - e.g., `general_overview.pdf`
|
| 59 |
+
- `overview` - e.g., `platform_overview.pdf`
|
| 60 |
+
- `comparison` - e.g., `version_comparison.pdf`
|
| 61 |
+
|
| 62 |
+
## Full Re-upload with FAQ
|
| 63 |
+
|
| 64 |
+
Once you have FAQ documents in the `/pdfs` directory:
|
| 65 |
+
```bash
|
| 66 |
+
# This will detect and upload FAQ documents automatically
|
| 67 |
+
python backend/upload_versioned_pdfs.py
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## Forcing Empty Store Creation
|
| 71 |
+
|
| 72 |
+
To create all stores including empty ones:
|
| 73 |
+
```bash
|
| 74 |
+
python backend/upload_versioned_pdfs.py --create-empty
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
This is useful if you want all stores ready even without documents.
|
backend/IMPORTANT_API_CHANGES.md
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Important: OpenAI API Changes
|
| 2 |
+
|
| 3 |
+
## Vector Stores API Location
|
| 4 |
+
|
| 5 |
+
As of OpenAI Python SDK 1.93.x, the vector stores API has moved:
|
| 6 |
+
|
| 7 |
+
- **OLD**: `client.beta.vector_stores`
|
| 8 |
+
- **NEW**: `client.vector_stores`
|
| 9 |
+
|
| 10 |
+
## How Vector Stores Work
|
| 11 |
+
|
| 12 |
+
Vector stores are designed to work with the Assistants API:
|
| 13 |
+
|
| 14 |
+
1. **Create vector stores**: `client.vector_stores.create()`
|
| 15 |
+
2. **Upload files to stores**: `client.vector_stores.files.create()`
|
| 16 |
+
3. **Use with assistants**: Vector stores are queried through assistants using the file_search tool
|
| 17 |
+
|
| 18 |
+
## The Architecture
|
| 19 |
+
|
| 20 |
+
```
|
| 21 |
+
Vector Stores (storage) -> Assistants (query interface) -> Threads (conversations)
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
## Current Implementation Status
|
| 25 |
+
|
| 26 |
+
1. **upload_versioned_pdfs.py**: ✅ Fixed to use `client.vector_stores`
|
| 27 |
+
2. **add_to_vector_store.py**: ✅ Fixed to use `client.vector_stores`
|
| 28 |
+
3. **vector_store_manager.py**: ❌ Needs assistant creation for querying
|
| 29 |
+
|
| 30 |
+
## Next Steps
|
| 31 |
+
|
| 32 |
+
To properly use vector stores for querying, you need to:
|
| 33 |
+
|
| 34 |
+
1. Create an assistant with file_search capability
|
| 35 |
+
2. Attach vector stores to the assistant
|
| 36 |
+
3. Use threads to query the assistant
|
| 37 |
+
|
| 38 |
+
Alternative approach:
|
| 39 |
+
- Use OpenAI embeddings API directly
|
| 40 |
+
- Store embeddings in a local database
|
| 41 |
+
- Implement your own similarity search
|
| 42 |
+
|
| 43 |
+
This would avoid the complexity of the Assistants API but requires more implementation work.
|
backend/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
# Backend package
|
backend/add_to_vector_store.py
ADDED
|
@@ -0,0 +1,195 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Add documents to existing OpenAI vector stores.
|
| 4 |
+
Useful for adding FAQ documents or updating existing stores.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
import argparse
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from typing import List, Optional
|
| 13 |
+
from openai import OpenAI, __version__ as openai_version
|
| 14 |
+
from packaging import version
|
| 15 |
+
import sys
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
class VectorStoreUpdater:
|
| 19 |
+
def __init__(self, api_key: Optional[str] = None):
|
| 20 |
+
"""Initialize the updater with OpenAI client."""
|
| 21 |
+
# Check OpenAI version
|
| 22 |
+
if version.parse(openai_version) < version.parse("1.50.0"):
|
| 23 |
+
print(f"Error: OpenAI library version {openai_version} is too old.")
|
| 24 |
+
print("Vector stores require version 1.50.0 or higher.")
|
| 25 |
+
print("Please run: pip install --upgrade openai>=1.50.0")
|
| 26 |
+
sys.exit(1)
|
| 27 |
+
|
| 28 |
+
self.client = OpenAI(api_key=api_key or os.getenv("OPENAI_API_KEY"))
|
| 29 |
+
self.config_path = Path(__file__).parent.parent / "config" / "vector_stores.json"
|
| 30 |
+
self.load_config()
|
| 31 |
+
|
| 32 |
+
def load_config(self):
|
| 33 |
+
"""Load vector store configuration."""
|
| 34 |
+
if not self.config_path.exists():
|
| 35 |
+
print(f"Error: Configuration file not found at {self.config_path}")
|
| 36 |
+
print("Please run upload_versioned_pdfs.py first to create vector stores.")
|
| 37 |
+
sys.exit(1)
|
| 38 |
+
|
| 39 |
+
with open(self.config_path, 'r') as f:
|
| 40 |
+
self.config = json.load(f)
|
| 41 |
+
self.vector_stores = self.config.get('vector_stores', {})
|
| 42 |
+
|
| 43 |
+
def list_stores(self):
|
| 44 |
+
"""List all available vector stores."""
|
| 45 |
+
print("\nAvailable vector stores:")
|
| 46 |
+
for store_name, store_id in self.vector_stores.items():
|
| 47 |
+
print(f" - {store_name}: {store_id}")
|
| 48 |
+
|
| 49 |
+
def add_file_to_store(self, store_name: str, file_path: Path) -> bool:
|
| 50 |
+
"""Add a file to an existing vector store."""
|
| 51 |
+
if store_name not in self.vector_stores:
|
| 52 |
+
print(f"Error: Vector store '{store_name}' not found.")
|
| 53 |
+
self.list_stores()
|
| 54 |
+
return False
|
| 55 |
+
|
| 56 |
+
store_id = self.vector_stores[store_name]
|
| 57 |
+
print(f"Adding {file_path.name} to {store_name} ({store_id})...")
|
| 58 |
+
|
| 59 |
+
try:
|
| 60 |
+
# Upload file
|
| 61 |
+
with open(file_path, "rb") as file:
|
| 62 |
+
file_upload = self.client.files.create(
|
| 63 |
+
file=file,
|
| 64 |
+
purpose="assistants"
|
| 65 |
+
)
|
| 66 |
+
|
| 67 |
+
# Add file to vector store
|
| 68 |
+
self.client.vector_stores.files.create(
|
| 69 |
+
vector_store_id=store_id,
|
| 70 |
+
file_id=file_upload.id
|
| 71 |
+
)
|
| 72 |
+
|
| 73 |
+
# Wait for processing
|
| 74 |
+
while True:
|
| 75 |
+
file_status = self.client.vector_stores.files.retrieve(
|
| 76 |
+
vector_store_id=store_id,
|
| 77 |
+
file_id=file_upload.id
|
| 78 |
+
)
|
| 79 |
+
if file_status.status == "completed":
|
| 80 |
+
print(f"✓ Successfully added {file_path.name}")
|
| 81 |
+
return True
|
| 82 |
+
elif file_status.status == "failed":
|
| 83 |
+
print(f"✗ Failed to process {file_path.name}")
|
| 84 |
+
return False
|
| 85 |
+
time.sleep(2)
|
| 86 |
+
|
| 87 |
+
except Exception as e:
|
| 88 |
+
print(f"✗ Error adding file: {str(e)}")
|
| 89 |
+
return False
|
| 90 |
+
|
| 91 |
+
def add_multiple_files(self, store_name: str, file_paths: List[Path]):
|
| 92 |
+
"""Add multiple files to a vector store."""
|
| 93 |
+
if not file_paths:
|
| 94 |
+
print("No files to add.")
|
| 95 |
+
return
|
| 96 |
+
|
| 97 |
+
print(f"\nAdding {len(file_paths)} files to {store_name}...")
|
| 98 |
+
success_count = 0
|
| 99 |
+
|
| 100 |
+
for file_path in file_paths:
|
| 101 |
+
if self.add_file_to_store(store_name, file_path):
|
| 102 |
+
success_count += 1
|
| 103 |
+
|
| 104 |
+
print(f"\n✓ Successfully added {success_count}/{len(file_paths)} files")
|
| 105 |
+
|
| 106 |
+
def create_empty_store(self, store_name: str, name: str, description: str) -> Optional[str]:
|
| 107 |
+
"""Create a new empty vector store."""
|
| 108 |
+
if store_name in self.vector_stores:
|
| 109 |
+
print(f"Error: Vector store '{store_name}' already exists.")
|
| 110 |
+
return None
|
| 111 |
+
|
| 112 |
+
print(f"Creating new vector store: {name}")
|
| 113 |
+
# Note: description parameter no longer supported, storing it in config instead
|
| 114 |
+
try:
|
| 115 |
+
vector_store = self.client.vector_stores.create(
|
| 116 |
+
name=name
|
| 117 |
+
)
|
| 118 |
+
|
| 119 |
+
# Update config
|
| 120 |
+
self.vector_stores[store_name] = vector_store.id
|
| 121 |
+
self.config['vector_stores'] = self.vector_stores
|
| 122 |
+
|
| 123 |
+
# Store description in config since API no longer supports it
|
| 124 |
+
if 'descriptions' not in self.config:
|
| 125 |
+
self.config['descriptions'] = {}
|
| 126 |
+
self.config['descriptions'][store_name] = description
|
| 127 |
+
|
| 128 |
+
with open(self.config_path, 'w') as f:
|
| 129 |
+
json.dump(self.config, f, indent=2)
|
| 130 |
+
|
| 131 |
+
print(f"✓ Created vector store: {store_name} ({vector_store.id})")
|
| 132 |
+
return vector_store.id
|
| 133 |
+
|
| 134 |
+
except Exception as e:
|
| 135 |
+
print(f"✗ Error creating vector store: {str(e)}")
|
| 136 |
+
return None
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
def main():
|
| 140 |
+
"""Main function."""
|
| 141 |
+
parser = argparse.ArgumentParser(description="Add documents to OpenAI vector stores")
|
| 142 |
+
|
| 143 |
+
subparsers = parser.add_subparsers(dest='command', help='Commands')
|
| 144 |
+
|
| 145 |
+
# List command
|
| 146 |
+
list_parser = subparsers.add_parser('list', help='List available vector stores')
|
| 147 |
+
|
| 148 |
+
# Add command
|
| 149 |
+
add_parser = subparsers.add_parser('add', help='Add files to a vector store')
|
| 150 |
+
add_parser.add_argument('store_name', help='Name of the vector store (e.g., general_faq)')
|
| 151 |
+
add_parser.add_argument('files', nargs='+', help='Files to add')
|
| 152 |
+
|
| 153 |
+
# Create command
|
| 154 |
+
create_parser = subparsers.add_parser('create', help='Create a new empty vector store')
|
| 155 |
+
create_parser.add_argument('store_name', help='Internal name (e.g., general_faq)')
|
| 156 |
+
create_parser.add_argument('--name', required=True, help='Display name')
|
| 157 |
+
create_parser.add_argument('--description', required=True, help='Description')
|
| 158 |
+
|
| 159 |
+
args = parser.parse_args()
|
| 160 |
+
|
| 161 |
+
if not args.command:
|
| 162 |
+
parser.print_help()
|
| 163 |
+
return
|
| 164 |
+
|
| 165 |
+
# Check for API key
|
| 166 |
+
if not os.getenv("OPENAI_API_KEY"):
|
| 167 |
+
print("Error: OPENAI_API_KEY environment variable not set")
|
| 168 |
+
return
|
| 169 |
+
|
| 170 |
+
updater = VectorStoreUpdater()
|
| 171 |
+
|
| 172 |
+
if args.command == 'list':
|
| 173 |
+
updater.list_stores()
|
| 174 |
+
|
| 175 |
+
elif args.command == 'add':
|
| 176 |
+
# Convert file paths
|
| 177 |
+
file_paths = []
|
| 178 |
+
for file_arg in args.files:
|
| 179 |
+
file_path = Path(file_arg)
|
| 180 |
+
if not file_path.exists():
|
| 181 |
+
print(f"Warning: File not found: {file_path}")
|
| 182 |
+
else:
|
| 183 |
+
file_paths.append(file_path)
|
| 184 |
+
|
| 185 |
+
if file_paths:
|
| 186 |
+
updater.add_multiple_files(args.store_name, file_paths)
|
| 187 |
+
else:
|
| 188 |
+
print("No valid files to add.")
|
| 189 |
+
|
| 190 |
+
elif args.command == 'create':
|
| 191 |
+
updater.create_empty_store(args.store_name, args.name, args.description)
|
| 192 |
+
|
| 193 |
+
|
| 194 |
+
if __name__ == "__main__":
|
| 195 |
+
main()
|
backend/chatbot_backend.py
ADDED
|
@@ -0,0 +1,274 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
OpenAI Chatbot Backend with Multi-Vector Store Support and MCP-style Tools
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import os
|
| 6 |
+
import json
|
| 7 |
+
import time
|
| 8 |
+
import logging
|
| 9 |
+
from typing import Dict, List, Optional, Tuple, Generator
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from openai import OpenAI
|
| 12 |
+
import tiktoken
|
| 13 |
+
|
| 14 |
+
from .vector_store_manager import VectorStoreManager
|
| 15 |
+
from .document_reader import DocumentReader
|
| 16 |
+
from ..tools.vector_search_tool import (
|
| 17 |
+
get_vector_search_tool_definition,
|
| 18 |
+
execute_vector_search,
|
| 19 |
+
format_search_results_for_context
|
| 20 |
+
)
|
| 21 |
+
from ..tools.document_reader_tool import (
|
| 22 |
+
get_document_reader_tool_definition,
|
| 23 |
+
execute_document_read,
|
| 24 |
+
format_document_content_for_context
|
| 25 |
+
)
|
| 26 |
+
|
| 27 |
+
logging.basicConfig(level=logging.INFO)
|
| 28 |
+
logger = logging.getLogger(__name__)
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
class ChatbotBackend:
|
| 32 |
+
def __init__(self, api_key: Optional[str] = None):
|
| 33 |
+
"""Initialize the chatbot backend."""
|
| 34 |
+
self.client = OpenAI(api_key=api_key or os.getenv("OPENAI_API_KEY"))
|
| 35 |
+
self.vector_store_manager = VectorStoreManager(self.client)
|
| 36 |
+
self.document_reader = DocumentReader()
|
| 37 |
+
|
| 38 |
+
# Load configuration
|
| 39 |
+
config_path = Path(__file__).parent.parent / "config" / "openai_config.json"
|
| 40 |
+
with open(config_path, 'r') as f:
|
| 41 |
+
self.config = json.load(f)
|
| 42 |
+
|
| 43 |
+
# Initialize tokenizer for token counting
|
| 44 |
+
self.encoding = tiktoken.encoding_for_model("gpt-4o")
|
| 45 |
+
|
| 46 |
+
# Define available tools
|
| 47 |
+
self.tools = [
|
| 48 |
+
get_vector_search_tool_definition(),
|
| 49 |
+
get_document_reader_tool_definition()
|
| 50 |
+
]
|
| 51 |
+
|
| 52 |
+
def count_tokens(self, text: str) -> int:
|
| 53 |
+
"""Count tokens in text."""
|
| 54 |
+
return len(self.encoding.encode(text))
|
| 55 |
+
|
| 56 |
+
def query_with_version(self, query: str, product: str, version: str,
|
| 57 |
+
custom_prompt: Optional[str] = None,
|
| 58 |
+
model: str = "gpt-4o",
|
| 59 |
+
temperature: float = 0.7,
|
| 60 |
+
max_tokens: int = 4000) -> Generator[Dict, None, None]:
|
| 61 |
+
"""
|
| 62 |
+
Query the chatbot with automatic version-specific and general context.
|
| 63 |
+
Yields streaming responses.
|
| 64 |
+
"""
|
| 65 |
+
start_time = time.time()
|
| 66 |
+
|
| 67 |
+
# Query both version-specific and general vector stores
|
| 68 |
+
version_results, general_results = self.vector_store_manager.query_version_and_general(
|
| 69 |
+
product, version, query, max_results=self.config.get("max_chunks", 10)
|
| 70 |
+
)
|
| 71 |
+
|
| 72 |
+
# Format context from vector store results
|
| 73 |
+
context = self.vector_store_manager.format_search_results(
|
| 74 |
+
version_results, general_results, product, version
|
| 75 |
+
)
|
| 76 |
+
|
| 77 |
+
# Build the enhanced query
|
| 78 |
+
enhanced_query = f"{context}\n\nUser Question: {query}"
|
| 79 |
+
|
| 80 |
+
# Add custom prompt if provided
|
| 81 |
+
if custom_prompt:
|
| 82 |
+
enhanced_query = f"{custom_prompt}\n\n{enhanced_query}"
|
| 83 |
+
|
| 84 |
+
# Create messages
|
| 85 |
+
messages = [
|
| 86 |
+
{
|
| 87 |
+
"role": "system",
|
| 88 |
+
"content": (
|
| 89 |
+
f"You are an expert assistant for {product.capitalize()} version {version}. "
|
| 90 |
+
f"You have access to version-specific documentation and general information. "
|
| 91 |
+
f"You can use the provided tools to search for more information or read specific document pages. "
|
| 92 |
+
f"Always provide accurate, version-specific answers based on the documentation."
|
| 93 |
+
)
|
| 94 |
+
},
|
| 95 |
+
{"role": "user", "content": enhanced_query}
|
| 96 |
+
]
|
| 97 |
+
|
| 98 |
+
# Count input tokens
|
| 99 |
+
input_tokens = sum(self.count_tokens(msg["content"]) for msg in messages)
|
| 100 |
+
|
| 101 |
+
# Stream response with function calling
|
| 102 |
+
try:
|
| 103 |
+
stream = self.client.chat.completions.create(
|
| 104 |
+
model=model,
|
| 105 |
+
messages=messages,
|
| 106 |
+
temperature=temperature,
|
| 107 |
+
max_tokens=max_tokens,
|
| 108 |
+
stream=True,
|
| 109 |
+
tools=self.tools,
|
| 110 |
+
tool_choice="auto"
|
| 111 |
+
)
|
| 112 |
+
|
| 113 |
+
# Track usage
|
| 114 |
+
output_tokens = 0
|
| 115 |
+
full_response = ""
|
| 116 |
+
tool_calls = []
|
| 117 |
+
current_tool_call = None
|
| 118 |
+
|
| 119 |
+
for chunk in stream:
|
| 120 |
+
delta = chunk.choices[0].delta
|
| 121 |
+
|
| 122 |
+
# Handle tool calls
|
| 123 |
+
if delta.tool_calls:
|
| 124 |
+
for tool_call_delta in delta.tool_calls:
|
| 125 |
+
if tool_call_delta.id:
|
| 126 |
+
# New tool call
|
| 127 |
+
if current_tool_call:
|
| 128 |
+
tool_calls.append(current_tool_call)
|
| 129 |
+
current_tool_call = {
|
| 130 |
+
"id": tool_call_delta.id,
|
| 131 |
+
"type": "function",
|
| 132 |
+
"function": {
|
| 133 |
+
"name": tool_call_delta.function.name if tool_call_delta.function else "",
|
| 134 |
+
"arguments": ""
|
| 135 |
+
}
|
| 136 |
+
}
|
| 137 |
+
|
| 138 |
+
if tool_call_delta.function and tool_call_delta.function.arguments:
|
| 139 |
+
current_tool_call["function"]["arguments"] += tool_call_delta.function.arguments
|
| 140 |
+
|
| 141 |
+
# Handle regular content
|
| 142 |
+
if delta.content:
|
| 143 |
+
output_tokens += self.count_tokens(delta.content)
|
| 144 |
+
full_response += delta.content
|
| 145 |
+
|
| 146 |
+
yield {
|
| 147 |
+
"type": "content",
|
| 148 |
+
"content": delta.content,
|
| 149 |
+
"done": False
|
| 150 |
+
}
|
| 151 |
+
|
| 152 |
+
# Check if stream is done
|
| 153 |
+
if chunk.choices[0].finish_reason == "tool_calls":
|
| 154 |
+
# Add the last tool call
|
| 155 |
+
if current_tool_call:
|
| 156 |
+
tool_calls.append(current_tool_call)
|
| 157 |
+
|
| 158 |
+
# Execute tool calls
|
| 159 |
+
tool_results = self._execute_tool_calls(tool_calls)
|
| 160 |
+
|
| 161 |
+
# Continue conversation with tool results
|
| 162 |
+
messages.append({
|
| 163 |
+
"role": "assistant",
|
| 164 |
+
"content": full_response,
|
| 165 |
+
"tool_calls": tool_calls
|
| 166 |
+
})
|
| 167 |
+
|
| 168 |
+
for tool_result in tool_results:
|
| 169 |
+
messages.append({
|
| 170 |
+
"role": "tool",
|
| 171 |
+
"tool_call_id": tool_result["tool_call_id"],
|
| 172 |
+
"content": tool_result["content"]
|
| 173 |
+
})
|
| 174 |
+
|
| 175 |
+
# Get follow-up response
|
| 176 |
+
follow_up_stream = self.client.chat.completions.create(
|
| 177 |
+
model=model,
|
| 178 |
+
messages=messages,
|
| 179 |
+
temperature=temperature,
|
| 180 |
+
max_tokens=max_tokens,
|
| 181 |
+
stream=True
|
| 182 |
+
)
|
| 183 |
+
|
| 184 |
+
for follow_up_chunk in follow_up_stream:
|
| 185 |
+
if follow_up_chunk.choices[0].delta.content:
|
| 186 |
+
content = follow_up_chunk.choices[0].delta.content
|
| 187 |
+
output_tokens += self.count_tokens(content)
|
| 188 |
+
full_response += content
|
| 189 |
+
|
| 190 |
+
yield {
|
| 191 |
+
"type": "content",
|
| 192 |
+
"content": content,
|
| 193 |
+
"done": False
|
| 194 |
+
}
|
| 195 |
+
|
| 196 |
+
# Calculate final metrics
|
| 197 |
+
end_time = time.time()
|
| 198 |
+
response_time = end_time - start_time
|
| 199 |
+
|
| 200 |
+
# Calculate costs
|
| 201 |
+
model_info = self.config["models"].get(model, {})
|
| 202 |
+
input_cost = (input_tokens / 1_000_000) * model_info.get("input_cost", 0)
|
| 203 |
+
output_cost = (output_tokens / 1_000_000) * model_info.get("output_cost", 0)
|
| 204 |
+
total_cost = input_cost + output_cost
|
| 205 |
+
|
| 206 |
+
# Yield final metadata
|
| 207 |
+
yield {
|
| 208 |
+
"type": "metadata",
|
| 209 |
+
"done": True,
|
| 210 |
+
"usage": {
|
| 211 |
+
"input_tokens": input_tokens,
|
| 212 |
+
"output_tokens": output_tokens,
|
| 213 |
+
"total_tokens": input_tokens + output_tokens
|
| 214 |
+
},
|
| 215 |
+
"cost": {
|
| 216 |
+
"input": round(input_cost, 4),
|
| 217 |
+
"output": round(output_cost, 4),
|
| 218 |
+
"total": round(total_cost, 4)
|
| 219 |
+
},
|
| 220 |
+
"response_time": round(response_time, 2),
|
| 221 |
+
"model": model,
|
| 222 |
+
"version_context": f"{product.capitalize()} {version}"
|
| 223 |
+
}
|
| 224 |
+
|
| 225 |
+
except Exception as e:
|
| 226 |
+
logger.error(f"Error in chat completion: {str(e)}")
|
| 227 |
+
yield {
|
| 228 |
+
"type": "error",
|
| 229 |
+
"error": str(e),
|
| 230 |
+
"done": True
|
| 231 |
+
}
|
| 232 |
+
|
| 233 |
+
def _execute_tool_calls(self, tool_calls: List[Dict]) -> List[Dict]:
|
| 234 |
+
"""Execute tool calls and return results."""
|
| 235 |
+
results = []
|
| 236 |
+
|
| 237 |
+
for tool_call in tool_calls:
|
| 238 |
+
function_name = tool_call["function"]["name"]
|
| 239 |
+
arguments = json.loads(tool_call["function"]["arguments"])
|
| 240 |
+
|
| 241 |
+
if function_name == "search_vector_store":
|
| 242 |
+
result = execute_vector_search(
|
| 243 |
+
self.vector_store_manager,
|
| 244 |
+
arguments["query"],
|
| 245 |
+
arguments["vector_store_name"],
|
| 246 |
+
arguments.get("max_results", 5)
|
| 247 |
+
)
|
| 248 |
+
content = format_search_results_for_context(result)
|
| 249 |
+
|
| 250 |
+
elif function_name == "read_document_pages":
|
| 251 |
+
result = execute_document_read(
|
| 252 |
+
self.document_reader,
|
| 253 |
+
arguments["document_name"],
|
| 254 |
+
arguments.get("page_numbers")
|
| 255 |
+
)
|
| 256 |
+
content = format_document_content_for_context(result)
|
| 257 |
+
|
| 258 |
+
else:
|
| 259 |
+
content = f"Unknown function: {function_name}"
|
| 260 |
+
|
| 261 |
+
results.append({
|
| 262 |
+
"tool_call_id": tool_call["id"],
|
| 263 |
+
"content": content
|
| 264 |
+
})
|
| 265 |
+
|
| 266 |
+
return results
|
| 267 |
+
|
| 268 |
+
def get_available_versions(self) -> Dict[str, List[str]]:
|
| 269 |
+
"""Get all available product versions."""
|
| 270 |
+
return self.vector_store_manager.list_available_versions()
|
| 271 |
+
|
| 272 |
+
def get_available_models(self) -> Dict[str, Dict]:
|
| 273 |
+
"""Get available models and their information."""
|
| 274 |
+
return self.config["models"]
|
backend/document_reader.py
ADDED
|
@@ -0,0 +1,195 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Document Reader for page-level document access.
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import os
|
| 6 |
+
import json
|
| 7 |
+
from typing import List, Optional, Dict, Union
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
import logging
|
| 10 |
+
|
| 11 |
+
logger = logging.getLogger(__name__)
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
class DocumentReader:
|
| 15 |
+
def __init__(self, pages_dir: Optional[Path] = None):
|
| 16 |
+
"""Initialize the document reader."""
|
| 17 |
+
self.pages_dir = pages_dir or Path(__file__).parent.parent / "pages"
|
| 18 |
+
self.document_index = self._load_document_index()
|
| 19 |
+
|
| 20 |
+
def _load_document_index(self) -> Dict:
|
| 21 |
+
"""Load document index if available."""
|
| 22 |
+
index_path = self.pages_dir / "document_index.json"
|
| 23 |
+
if index_path.exists():
|
| 24 |
+
try:
|
| 25 |
+
with open(index_path, 'r') as f:
|
| 26 |
+
return json.load(f)
|
| 27 |
+
except Exception as e:
|
| 28 |
+
logger.error(f"Error loading document index: {e}")
|
| 29 |
+
return {}
|
| 30 |
+
|
| 31 |
+
def _normalize_document_name(self, document_name: str) -> str:
|
| 32 |
+
"""Normalize document name for consistent file matching."""
|
| 33 |
+
# Remove common prefixes/suffixes
|
| 34 |
+
name = document_name.strip()
|
| 35 |
+
name = name.replace(" ", "_")
|
| 36 |
+
name = name.replace(".", "_")
|
| 37 |
+
|
| 38 |
+
# Handle different formats
|
| 39 |
+
if not name.endswith(("UserGuide", "InstallationGuide", "QuickStartGuide")):
|
| 40 |
+
# Try to identify the document type
|
| 41 |
+
if "user" in name.lower() and "guide" in name.lower():
|
| 42 |
+
if not name.endswith("UserGuide"):
|
| 43 |
+
name = name.replace("User_Guide", "UserGuide")
|
| 44 |
+
elif "installation" in name.lower() and "guide" in name.lower():
|
| 45 |
+
if not name.endswith("InstallationGuide"):
|
| 46 |
+
name = name.replace("Installation_Guide", "InstallationGuide")
|
| 47 |
+
elif "quick" in name.lower() and "start" in name.lower():
|
| 48 |
+
if not name.endswith("QuickStartGuide"):
|
| 49 |
+
name = name.replace("Quick_Start_Guide", "QuickStartGuide")
|
| 50 |
+
|
| 51 |
+
return name
|
| 52 |
+
|
| 53 |
+
def get_table_of_contents(self, document_name: str) -> Optional[str]:
|
| 54 |
+
"""Get the table of contents for a document."""
|
| 55 |
+
normalized_name = self._normalize_document_name(document_name)
|
| 56 |
+
toc_filename = f"{normalized_name}_TOC.txt"
|
| 57 |
+
toc_path = self.pages_dir / toc_filename
|
| 58 |
+
|
| 59 |
+
if not toc_path.exists():
|
| 60 |
+
# Try alternative naming conventions
|
| 61 |
+
alternatives = [
|
| 62 |
+
f"{document_name}_TOC.txt",
|
| 63 |
+
f"{document_name.replace(' ', '_')}_TOC.txt",
|
| 64 |
+
f"{document_name.replace('.', '_')}_TOC.txt"
|
| 65 |
+
]
|
| 66 |
+
|
| 67 |
+
for alt in alternatives:
|
| 68 |
+
alt_path = self.pages_dir / alt
|
| 69 |
+
if alt_path.exists():
|
| 70 |
+
toc_path = alt_path
|
| 71 |
+
break
|
| 72 |
+
|
| 73 |
+
if toc_path.exists():
|
| 74 |
+
try:
|
| 75 |
+
with open(toc_path, 'r', encoding='utf-8') as f:
|
| 76 |
+
return f.read()
|
| 77 |
+
except Exception as e:
|
| 78 |
+
logger.error(f"Error reading TOC file {toc_path}: {e}")
|
| 79 |
+
return None
|
| 80 |
+
|
| 81 |
+
logger.warning(f"TOC file not found for document: {document_name}")
|
| 82 |
+
return None
|
| 83 |
+
|
| 84 |
+
def read_pages(self, document_name: str, page_numbers: Optional[List[int]] = None) -> Union[str, Dict[int, str]]:
|
| 85 |
+
"""
|
| 86 |
+
Read specific pages from a document.
|
| 87 |
+
If page_numbers is None, returns the table of contents.
|
| 88 |
+
"""
|
| 89 |
+
if page_numbers is None:
|
| 90 |
+
# Return table of contents
|
| 91 |
+
toc = self.get_table_of_contents(document_name)
|
| 92 |
+
if toc:
|
| 93 |
+
return f"Table of Contents for {document_name}:\n\n{toc}"
|
| 94 |
+
else:
|
| 95 |
+
return f"Table of contents not found for document: {document_name}"
|
| 96 |
+
|
| 97 |
+
# Read specific pages
|
| 98 |
+
normalized_name = self._normalize_document_name(document_name)
|
| 99 |
+
pages_content = {}
|
| 100 |
+
|
| 101 |
+
for page_num in page_numbers:
|
| 102 |
+
page_filename = f"{normalized_name}_page_{page_num:03d}.txt"
|
| 103 |
+
page_path = self.pages_dir / page_filename
|
| 104 |
+
|
| 105 |
+
if not page_path.exists():
|
| 106 |
+
# Try alternative formats
|
| 107 |
+
alternatives = [
|
| 108 |
+
f"{document_name}_page_{page_num:03d}.txt",
|
| 109 |
+
f"{document_name.replace(' ', '_')}_page_{page_num:03d}.txt",
|
| 110 |
+
f"{document_name.replace('.', '_')}_page_{page_num:03d}.txt"
|
| 111 |
+
]
|
| 112 |
+
|
| 113 |
+
for alt in alternatives:
|
| 114 |
+
alt_path = self.pages_dir / alt
|
| 115 |
+
if alt_path.exists():
|
| 116 |
+
page_path = alt_path
|
| 117 |
+
break
|
| 118 |
+
|
| 119 |
+
if page_path.exists():
|
| 120 |
+
try:
|
| 121 |
+
with open(page_path, 'r', encoding='utf-8') as f:
|
| 122 |
+
pages_content[page_num] = f.read()
|
| 123 |
+
except Exception as e:
|
| 124 |
+
logger.error(f"Error reading page {page_num} from {document_name}: {e}")
|
| 125 |
+
pages_content[page_num] = f"Error reading page {page_num}"
|
| 126 |
+
else:
|
| 127 |
+
pages_content[page_num] = f"Page {page_num} not found"
|
| 128 |
+
|
| 129 |
+
# Format the output
|
| 130 |
+
if len(pages_content) == 1:
|
| 131 |
+
page_num = list(pages_content.keys())[0]
|
| 132 |
+
return f"Page {page_num} of {document_name}:\n\n{pages_content[page_num]}"
|
| 133 |
+
else:
|
| 134 |
+
formatted_pages = []
|
| 135 |
+
for page_num in sorted(pages_content.keys()):
|
| 136 |
+
formatted_pages.append(f"=== Page {page_num} ===\n{pages_content[page_num]}")
|
| 137 |
+
return f"Pages from {document_name}:\n\n" + "\n\n".join(formatted_pages)
|
| 138 |
+
|
| 139 |
+
def list_available_documents(self) -> List[str]:
|
| 140 |
+
"""List all available documents."""
|
| 141 |
+
documents = set()
|
| 142 |
+
|
| 143 |
+
# Scan for TOC files
|
| 144 |
+
for toc_file in self.pages_dir.glob("*_TOC.txt"):
|
| 145 |
+
doc_name = toc_file.stem.replace("_TOC", "")
|
| 146 |
+
documents.add(doc_name)
|
| 147 |
+
|
| 148 |
+
# Also check document index
|
| 149 |
+
if self.document_index:
|
| 150 |
+
documents.update(self.document_index.keys())
|
| 151 |
+
|
| 152 |
+
return sorted(list(documents))
|
| 153 |
+
|
| 154 |
+
def get_document_info(self, document_name: str) -> Dict[str, any]:
|
| 155 |
+
"""Get information about a document (number of pages, etc.)."""
|
| 156 |
+
normalized_name = self._normalize_document_name(document_name)
|
| 157 |
+
info = {
|
| 158 |
+
"name": document_name,
|
| 159 |
+
"normalized_name": normalized_name,
|
| 160 |
+
"has_toc": False,
|
| 161 |
+
"page_count": 0,
|
| 162 |
+
"available_pages": []
|
| 163 |
+
}
|
| 164 |
+
|
| 165 |
+
# Check for TOC
|
| 166 |
+
toc_path = self.pages_dir / f"{normalized_name}_TOC.txt"
|
| 167 |
+
info["has_toc"] = toc_path.exists()
|
| 168 |
+
|
| 169 |
+
# Count pages
|
| 170 |
+
page_pattern = f"{normalized_name}_page_*.txt"
|
| 171 |
+
page_files = list(self.pages_dir.glob(page_pattern))
|
| 172 |
+
|
| 173 |
+
if not page_files:
|
| 174 |
+
# Try alternative patterns
|
| 175 |
+
for alt_pattern in [f"{document_name}_page_*.txt",
|
| 176 |
+
f"{document_name.replace(' ', '_')}_page_*.txt"]:
|
| 177 |
+
page_files = list(self.pages_dir.glob(alt_pattern))
|
| 178 |
+
if page_files:
|
| 179 |
+
break
|
| 180 |
+
|
| 181 |
+
if page_files:
|
| 182 |
+
page_numbers = []
|
| 183 |
+
for page_file in page_files:
|
| 184 |
+
try:
|
| 185 |
+
# Extract page number from filename
|
| 186 |
+
page_num_str = page_file.stem.split("_page_")[-1]
|
| 187 |
+
page_num = int(page_num_str)
|
| 188 |
+
page_numbers.append(page_num)
|
| 189 |
+
except:
|
| 190 |
+
pass
|
| 191 |
+
|
| 192 |
+
info["page_count"] = len(page_numbers)
|
| 193 |
+
info["available_pages"] = sorted(page_numbers)
|
| 194 |
+
|
| 195 |
+
return info
|
backend/test_pdf_mapping.py
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Test script to verify PDF mapping before uploading."""
|
| 3 |
+
|
| 4 |
+
from upload_versioned_pdfs import VectorStoreUploader
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
|
| 7 |
+
def main():
|
| 8 |
+
"""Test PDF file detection and mapping."""
|
| 9 |
+
uploader = VectorStoreUploader()
|
| 10 |
+
|
| 11 |
+
print("PDF Directory:", uploader.pdf_directory)
|
| 12 |
+
print("Directory exists:", uploader.pdf_directory.exists())
|
| 13 |
+
print()
|
| 14 |
+
|
| 15 |
+
if uploader.pdf_directory.exists():
|
| 16 |
+
all_pdfs = list(uploader.pdf_directory.glob("*.pdf"))
|
| 17 |
+
print(f"Total PDFs found: {len(all_pdfs)}")
|
| 18 |
+
print("\nAll PDF files:")
|
| 19 |
+
for pdf in sorted(all_pdfs):
|
| 20 |
+
print(f" - {pdf.name}")
|
| 21 |
+
print()
|
| 22 |
+
|
| 23 |
+
pdf_mapping = uploader.get_pdf_files()
|
| 24 |
+
|
| 25 |
+
print("\nPDF Mapping by Version:")
|
| 26 |
+
for store_name, pdf_files in pdf_mapping.items():
|
| 27 |
+
print(f"\n{store_name}: ({len(pdf_files)} files)")
|
| 28 |
+
for pdf in pdf_files:
|
| 29 |
+
print(f" - {pdf.name}")
|
| 30 |
+
|
| 31 |
+
if __name__ == "__main__":
|
| 32 |
+
main()
|
backend/upload_versioned_pdfs.py
ADDED
|
@@ -0,0 +1,239 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Upload versioned PDFs to separate OpenAI vector stores.
|
| 4 |
+
Creates one vector store per version and one general/FAQ store.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
import sys
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from typing import Dict, List, Optional
|
| 13 |
+
from openai import OpenAI, __version__ as openai_version
|
| 14 |
+
from datetime import datetime
|
| 15 |
+
from packaging import version
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
class VectorStoreUploader:
|
| 19 |
+
def __init__(self, api_key: Optional[str] = None, skip_empty: bool = True):
|
| 20 |
+
"""Initialize the uploader with OpenAI client.
|
| 21 |
+
|
| 22 |
+
Args:
|
| 23 |
+
api_key: OpenAI API key
|
| 24 |
+
skip_empty: Skip creation of empty vector stores
|
| 25 |
+
"""
|
| 26 |
+
# Check OpenAI version
|
| 27 |
+
if version.parse(openai_version) < version.parse("1.50.0"):
|
| 28 |
+
print(f"Error: OpenAI library version {openai_version} is too old.")
|
| 29 |
+
print("Vector stores require version 1.50.0 or higher.")
|
| 30 |
+
print("Please run: pip install --upgrade openai>=1.50.0")
|
| 31 |
+
sys.exit(1)
|
| 32 |
+
|
| 33 |
+
self.client = OpenAI(api_key=api_key or os.getenv("OPENAI_API_KEY"))
|
| 34 |
+
self.config_path = Path(__file__).parent.parent / "config" / "vector_stores.json"
|
| 35 |
+
self.pdf_directory = Path("/Users/jsv/Work/ataya/concert-master/pdfs")
|
| 36 |
+
self.skip_empty = skip_empty
|
| 37 |
+
|
| 38 |
+
def create_vector_store(self, name: str, description: str) -> str:
|
| 39 |
+
"""Create a new vector store and return its ID."""
|
| 40 |
+
print(f"Creating vector store: {name}")
|
| 41 |
+
# Note: description parameter no longer supported in API
|
| 42 |
+
vector_store = self.client.vector_stores.create(
|
| 43 |
+
name=name
|
| 44 |
+
)
|
| 45 |
+
return vector_store.id
|
| 46 |
+
|
| 47 |
+
def upload_file_to_store(self, vector_store_id: str, file_path: Path) -> str:
|
| 48 |
+
"""Upload a file to a vector store."""
|
| 49 |
+
print(f" Uploading {file_path.name}...")
|
| 50 |
+
|
| 51 |
+
# Upload file
|
| 52 |
+
with open(file_path, "rb") as file:
|
| 53 |
+
file_upload = self.client.files.create(
|
| 54 |
+
file=file,
|
| 55 |
+
purpose="assistants"
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
# Add file to vector store
|
| 59 |
+
self.client.vector_stores.files.create(
|
| 60 |
+
vector_store_id=vector_store_id,
|
| 61 |
+
file_id=file_upload.id
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
# Wait for processing
|
| 65 |
+
while True:
|
| 66 |
+
file_status = self.client.vector_stores.files.retrieve(
|
| 67 |
+
vector_store_id=vector_store_id,
|
| 68 |
+
file_id=file_upload.id
|
| 69 |
+
)
|
| 70 |
+
if file_status.status == "completed":
|
| 71 |
+
print(f" ✓ {file_path.name} processed successfully")
|
| 72 |
+
break
|
| 73 |
+
elif file_status.status == "failed":
|
| 74 |
+
print(f" ✗ {file_path.name} failed to process")
|
| 75 |
+
break
|
| 76 |
+
time.sleep(2)
|
| 77 |
+
|
| 78 |
+
return file_upload.id
|
| 79 |
+
|
| 80 |
+
def get_pdf_files(self) -> Dict[str, List[Path]]:
|
| 81 |
+
"""Organize PDF files by version."""
|
| 82 |
+
pdf_mapping = {
|
| 83 |
+
"harmony_1_2": [],
|
| 84 |
+
"harmony_1_5": [],
|
| 85 |
+
"harmony_1_6": [],
|
| 86 |
+
"harmony_1_8": [],
|
| 87 |
+
"chorus_1_1": [],
|
| 88 |
+
"general_faq": []
|
| 89 |
+
}
|
| 90 |
+
|
| 91 |
+
if not self.pdf_directory.exists():
|
| 92 |
+
print(f"PDF directory not found: {self.pdf_directory}")
|
| 93 |
+
return pdf_mapping
|
| 94 |
+
|
| 95 |
+
# Map file patterns to versions
|
| 96 |
+
for pdf_file in self.pdf_directory.glob("*.pdf"):
|
| 97 |
+
filename = pdf_file.name.lower()
|
| 98 |
+
|
| 99 |
+
# Check for Harmony versions
|
| 100 |
+
if "harmony" in filename:
|
| 101 |
+
if "1.2" in filename or "r1.2" in filename:
|
| 102 |
+
pdf_mapping["harmony_1_2"].append(pdf_file)
|
| 103 |
+
elif "1.5" in filename or "r1.5" in filename:
|
| 104 |
+
pdf_mapping["harmony_1_5"].append(pdf_file)
|
| 105 |
+
elif "1.6" in filename or "r1.6" in filename:
|
| 106 |
+
pdf_mapping["harmony_1_6"].append(pdf_file)
|
| 107 |
+
elif "1.8" in filename or "r1.8" in filename:
|
| 108 |
+
pdf_mapping["harmony_1_8"].append(pdf_file)
|
| 109 |
+
|
| 110 |
+
# Check for Chorus versions
|
| 111 |
+
elif "chorus" in filename:
|
| 112 |
+
if "1.1" in filename or "r1.1" in filename:
|
| 113 |
+
pdf_mapping["chorus_1_1"].append(pdf_file)
|
| 114 |
+
|
| 115 |
+
# General/FAQ documents
|
| 116 |
+
elif any(keyword in filename for keyword in ["faq", "general", "overview", "comparison"]):
|
| 117 |
+
pdf_mapping["general_faq"].append(pdf_file)
|
| 118 |
+
|
| 119 |
+
return pdf_mapping
|
| 120 |
+
|
| 121 |
+
def upload_all_pdfs(self):
|
| 122 |
+
"""Create vector stores and upload all PDFs."""
|
| 123 |
+
pdf_mapping = self.get_pdf_files()
|
| 124 |
+
vector_stores = {}
|
| 125 |
+
descriptions = {}
|
| 126 |
+
|
| 127 |
+
# Create vector stores and upload files
|
| 128 |
+
for store_name, pdf_files in pdf_mapping.items():
|
| 129 |
+
if not pdf_files:
|
| 130 |
+
if self.skip_empty:
|
| 131 |
+
print(f"\nNo PDFs found for {store_name}, skipping...")
|
| 132 |
+
continue
|
| 133 |
+
else:
|
| 134 |
+
print(f"\nNo PDFs found for {store_name}, but creating empty store...")
|
| 135 |
+
|
| 136 |
+
# Create descriptive name and description
|
| 137 |
+
if store_name == "general_faq":
|
| 138 |
+
name = "General FAQ and Overview"
|
| 139 |
+
description = "General information, FAQs, and cross-version content"
|
| 140 |
+
else:
|
| 141 |
+
product, version = store_name.split("_", 1)
|
| 142 |
+
version_display = version.replace("_", ".")
|
| 143 |
+
name = f"{product.capitalize()} {version_display}"
|
| 144 |
+
description = f"Documentation for {product.capitalize()} version {version_display}"
|
| 145 |
+
|
| 146 |
+
# Create vector store
|
| 147 |
+
vector_store_id = self.create_vector_store(name, description)
|
| 148 |
+
vector_stores[store_name] = vector_store_id
|
| 149 |
+
descriptions[store_name] = description
|
| 150 |
+
|
| 151 |
+
# Upload files
|
| 152 |
+
print(f"\nUploading {len(pdf_files)} files to {name}:")
|
| 153 |
+
for pdf_file in pdf_files:
|
| 154 |
+
self.upload_file_to_store(vector_store_id, pdf_file)
|
| 155 |
+
|
| 156 |
+
# Save configuration
|
| 157 |
+
self.save_config(vector_stores, descriptions)
|
| 158 |
+
|
| 159 |
+
return vector_stores
|
| 160 |
+
|
| 161 |
+
def save_config(self, vector_stores: Dict[str, str], descriptions: Dict[str, str]):
|
| 162 |
+
"""Save vector store configuration."""
|
| 163 |
+
config = {
|
| 164 |
+
"vector_stores": vector_stores,
|
| 165 |
+
"descriptions": descriptions,
|
| 166 |
+
"latest_versions": {
|
| 167 |
+
"harmony": "1.8",
|
| 168 |
+
"chorus": "1.1"
|
| 169 |
+
},
|
| 170 |
+
"created_at": datetime.now().isoformat(),
|
| 171 |
+
"chunk_size": 1000,
|
| 172 |
+
"max_chunks": 10
|
| 173 |
+
}
|
| 174 |
+
|
| 175 |
+
# Ensure config directory exists
|
| 176 |
+
self.config_path.parent.mkdir(parents=True, exist_ok=True)
|
| 177 |
+
|
| 178 |
+
# Save configuration
|
| 179 |
+
with open(self.config_path, "w") as f:
|
| 180 |
+
json.dump(config, f, indent=2)
|
| 181 |
+
|
| 182 |
+
print(f"\nConfiguration saved to: {self.config_path}")
|
| 183 |
+
print(json.dumps(config, indent=2))
|
| 184 |
+
|
| 185 |
+
|
| 186 |
+
def main():
|
| 187 |
+
"""Main function to run the upload process."""
|
| 188 |
+
import argparse
|
| 189 |
+
|
| 190 |
+
parser = argparse.ArgumentParser(description="Upload PDFs to OpenAI vector stores")
|
| 191 |
+
parser.add_argument(
|
| 192 |
+
"--create-empty",
|
| 193 |
+
action="store_true",
|
| 194 |
+
help="Create empty vector stores even if no PDFs are found"
|
| 195 |
+
)
|
| 196 |
+
parser.add_argument(
|
| 197 |
+
"--no-confirm",
|
| 198 |
+
action="store_true",
|
| 199 |
+
help="Skip confirmation prompt"
|
| 200 |
+
)
|
| 201 |
+
args = parser.parse_args()
|
| 202 |
+
|
| 203 |
+
print("OpenAI Chatbot MCP - Vector Store Setup")
|
| 204 |
+
print("=" * 50)
|
| 205 |
+
|
| 206 |
+
# Check for API key
|
| 207 |
+
if not os.getenv("OPENAI_API_KEY"):
|
| 208 |
+
print("Error: OPENAI_API_KEY environment variable not set")
|
| 209 |
+
return
|
| 210 |
+
|
| 211 |
+
# Create uploader and run
|
| 212 |
+
uploader = VectorStoreUploader(skip_empty=not args.create_empty)
|
| 213 |
+
|
| 214 |
+
# First, let's check what PDFs we have
|
| 215 |
+
print("\nScanning for PDF files...")
|
| 216 |
+
pdf_mapping = uploader.get_pdf_files()
|
| 217 |
+
|
| 218 |
+
print("\nFound PDFs:")
|
| 219 |
+
for store_name, pdf_files in pdf_mapping.items():
|
| 220 |
+
print(f"\n{store_name}:")
|
| 221 |
+
for pdf in pdf_files:
|
| 222 |
+
print(f" - {pdf.name}")
|
| 223 |
+
|
| 224 |
+
# Confirm before proceeding
|
| 225 |
+
if not args.no_confirm:
|
| 226 |
+
response = input("\nProceed with vector store creation? (yes/no): ")
|
| 227 |
+
if response.lower() != "yes":
|
| 228 |
+
print("Aborted.")
|
| 229 |
+
return
|
| 230 |
+
|
| 231 |
+
# Upload all PDFs
|
| 232 |
+
vector_stores = uploader.upload_all_pdfs()
|
| 233 |
+
|
| 234 |
+
print("\n✅ Vector store setup complete!")
|
| 235 |
+
print(f"Created {len(vector_stores)} vector stores")
|
| 236 |
+
|
| 237 |
+
|
| 238 |
+
if __name__ == "__main__":
|
| 239 |
+
main()
|
backend/vector_store_manager.py
ADDED
|
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Vector Store Manager for handling multiple version-specific vector stores.
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import os
|
| 6 |
+
import json
|
| 7 |
+
from typing import Dict, List, Optional, Tuple
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
from openai import OpenAI
|
| 10 |
+
import logging
|
| 11 |
+
|
| 12 |
+
logger = logging.getLogger(__name__)
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
class VectorStoreManager:
|
| 16 |
+
def __init__(self, client: OpenAI, config_path: Optional[Path] = None):
|
| 17 |
+
"""Initialize the vector store manager."""
|
| 18 |
+
self.client = client
|
| 19 |
+
self.config_path = config_path or Path(__file__).parent.parent / "config" / "vector_stores.json"
|
| 20 |
+
self.vector_stores = {}
|
| 21 |
+
self.latest_versions = {}
|
| 22 |
+
self.load_config()
|
| 23 |
+
|
| 24 |
+
def load_config(self):
|
| 25 |
+
"""Load vector store configuration from file."""
|
| 26 |
+
if not self.config_path.exists():
|
| 27 |
+
logger.warning(f"Vector store config not found at {self.config_path}")
|
| 28 |
+
return
|
| 29 |
+
|
| 30 |
+
try:
|
| 31 |
+
with open(self.config_path, 'r') as f:
|
| 32 |
+
config = json.load(f)
|
| 33 |
+
self.vector_stores = config.get('vector_stores', {})
|
| 34 |
+
self.latest_versions = config.get('latest_versions', {})
|
| 35 |
+
logger.info(f"Loaded {len(self.vector_stores)} vector stores from config")
|
| 36 |
+
except Exception as e:
|
| 37 |
+
logger.error(f"Error loading vector store config: {e}")
|
| 38 |
+
|
| 39 |
+
def get_store_name_from_version(self, product: str, version: str) -> str:
|
| 40 |
+
"""Convert product and version to store name."""
|
| 41 |
+
# Normalize version (e.g., "1.8" -> "1_8")
|
| 42 |
+
version_normalized = version.replace(".", "_")
|
| 43 |
+
return f"{product.lower()}_{version_normalized}"
|
| 44 |
+
|
| 45 |
+
def get_vector_store_id(self, store_name: str) -> Optional[str]:
|
| 46 |
+
"""Get vector store ID by name."""
|
| 47 |
+
return self.vector_stores.get(store_name)
|
| 48 |
+
|
| 49 |
+
def query_vector_store(self, store_name: str, query: str, max_results: int = 5) -> List[Dict]:
|
| 50 |
+
"""Query a specific vector store."""
|
| 51 |
+
store_id = self.get_vector_store_id(store_name)
|
| 52 |
+
if not store_id:
|
| 53 |
+
logger.warning(f"Vector store '{store_name}' not found")
|
| 54 |
+
return []
|
| 55 |
+
|
| 56 |
+
# Check for placeholder IDs
|
| 57 |
+
if store_id.startswith("vs_PLACEHOLDER"):
|
| 58 |
+
logger.warning(f"Vector store '{store_name}' has placeholder ID: {store_id}")
|
| 59 |
+
logger.warning("Please run upload_versioned_pdfs.py to create actual vector stores")
|
| 60 |
+
return []
|
| 61 |
+
|
| 62 |
+
try:
|
| 63 |
+
# Create a thread for the query
|
| 64 |
+
thread = self.client.beta.threads.create()
|
| 65 |
+
|
| 66 |
+
# Add the query as a message
|
| 67 |
+
self.client.beta.threads.messages.create(
|
| 68 |
+
thread_id=thread.id,
|
| 69 |
+
role="user",
|
| 70 |
+
content=query
|
| 71 |
+
)
|
| 72 |
+
|
| 73 |
+
# Run the assistant with the specific vector store
|
| 74 |
+
run = self.client.beta.threads.runs.create_and_poll(
|
| 75 |
+
thread_id=thread.id,
|
| 76 |
+
assistant_id="asst_temp", # This will be replaced with actual assistant ID
|
| 77 |
+
tools=[{"type": "file_search"}],
|
| 78 |
+
tool_resources={
|
| 79 |
+
"file_search": {
|
| 80 |
+
"vector_store_ids": [store_id]
|
| 81 |
+
}
|
| 82 |
+
}
|
| 83 |
+
)
|
| 84 |
+
|
| 85 |
+
# Get the messages
|
| 86 |
+
messages = self.client.beta.threads.messages.list(
|
| 87 |
+
thread_id=thread.id,
|
| 88 |
+
order="asc"
|
| 89 |
+
)
|
| 90 |
+
|
| 91 |
+
# Extract search results
|
| 92 |
+
results = []
|
| 93 |
+
for message in messages:
|
| 94 |
+
if message.role == "assistant":
|
| 95 |
+
for content in message.content:
|
| 96 |
+
if content.type == "text":
|
| 97 |
+
# Parse file search annotations
|
| 98 |
+
annotations = content.text.annotations
|
| 99 |
+
for annotation in annotations:
|
| 100 |
+
if annotation.type == "file_citation":
|
| 101 |
+
results.append({
|
| 102 |
+
"text": annotation.text,
|
| 103 |
+
"file_id": annotation.file_citation.file_id,
|
| 104 |
+
"quote": annotation.file_citation.quote
|
| 105 |
+
})
|
| 106 |
+
|
| 107 |
+
return results[:max_results]
|
| 108 |
+
|
| 109 |
+
except Exception as e:
|
| 110 |
+
logger.error(f"Error querying vector store '{store_name}': {e}")
|
| 111 |
+
return []
|
| 112 |
+
|
| 113 |
+
def query_version_and_general(self, product: str, version: str, query: str, max_results: int = 5) -> Tuple[List[Dict], List[Dict]]:
|
| 114 |
+
"""Query both version-specific and general vector stores."""
|
| 115 |
+
# Query version-specific store
|
| 116 |
+
store_name = self.get_store_name_from_version(product, version)
|
| 117 |
+
version_results = self.query_vector_store(store_name, query, max_results)
|
| 118 |
+
|
| 119 |
+
# Query general/FAQ store
|
| 120 |
+
general_results = self.query_vector_store("general_faq", query, max_results)
|
| 121 |
+
|
| 122 |
+
return version_results, general_results
|
| 123 |
+
|
| 124 |
+
def search_across_stores(self, query: str, store_names: Optional[List[str]] = None, max_results_per_store: int = 3) -> Dict[str, List[Dict]]:
|
| 125 |
+
"""Search across multiple vector stores."""
|
| 126 |
+
if store_names is None:
|
| 127 |
+
store_names = list(self.vector_stores.keys())
|
| 128 |
+
|
| 129 |
+
results = {}
|
| 130 |
+
for store_name in store_names:
|
| 131 |
+
if store_name in self.vector_stores:
|
| 132 |
+
store_results = self.query_vector_store(store_name, query, max_results_per_store)
|
| 133 |
+
if store_results:
|
| 134 |
+
results[store_name] = store_results
|
| 135 |
+
|
| 136 |
+
return results
|
| 137 |
+
|
| 138 |
+
def get_latest_version(self, product: str) -> Optional[str]:
|
| 139 |
+
"""Get the latest version for a product."""
|
| 140 |
+
return self.latest_versions.get(product.lower())
|
| 141 |
+
|
| 142 |
+
def list_available_versions(self) -> Dict[str, List[str]]:
|
| 143 |
+
"""List all available product versions."""
|
| 144 |
+
versions = {"harmony": [], "chorus": []}
|
| 145 |
+
|
| 146 |
+
for store_name in self.vector_stores.keys():
|
| 147 |
+
if store_name == "general_faq":
|
| 148 |
+
continue
|
| 149 |
+
|
| 150 |
+
parts = store_name.split("_", 1)
|
| 151 |
+
if len(parts) == 2:
|
| 152 |
+
product, version = parts
|
| 153 |
+
version_display = version.replace("_", ".")
|
| 154 |
+
if product in versions:
|
| 155 |
+
versions[product].append(version_display)
|
| 156 |
+
|
| 157 |
+
# Sort versions
|
| 158 |
+
for product in versions:
|
| 159 |
+
versions[product].sort(key=lambda x: [int(p) for p in x.split(".")])
|
| 160 |
+
|
| 161 |
+
return versions
|
| 162 |
+
|
| 163 |
+
def format_search_results(self, version_results: List[Dict], general_results: List[Dict], product: str, version: str) -> str:
|
| 164 |
+
"""Format search results for appending to user query."""
|
| 165 |
+
formatted = []
|
| 166 |
+
|
| 167 |
+
if version_results:
|
| 168 |
+
formatted.append(f"Based on {product.capitalize()} {version} documentation:")
|
| 169 |
+
for i, result in enumerate(version_results, 1):
|
| 170 |
+
formatted.append(f"{i}. {result.get('quote', result.get('text', ''))}")
|
| 171 |
+
formatted.append("")
|
| 172 |
+
|
| 173 |
+
if general_results:
|
| 174 |
+
formatted.append("Additional general information:")
|
| 175 |
+
for i, result in enumerate(general_results, 1):
|
| 176 |
+
formatted.append(f"{i}. {result.get('quote', result.get('text', ''))}")
|
| 177 |
+
|
| 178 |
+
return "\n".join(formatted)
|