Spaces:

MCP-1st-Birthday
/

ecomcp

Sleeping

ecomcp / docs /LLAMA_INDEX_GUIDE.md

feat: Implement LlamaIndex integration with new core modules for knowledge base, document loading, vector search, and comprehensive documentation and tests.

108d8af 3 months ago

preview code

raw

history blame contribute delete

9.32 kB

	# LlamaIndex Integration Guide

	Complete guide to the knowledge base indexing and retrieval system powered by LlamaIndex.

	## Overview

	The LlamaIndex integration provides:
	- Knowledge Base Indexing: Foundation for indexing documents and products
	- Vector Similarity Search: Semantic search across indexed content
	- Document Retrieval: Easy retrieval of relevant documents

	## Components

	### 1. Core Modules

	#### `KnowledgeBase` (knowledge_base.py)
	Low-level interface for index management.

	```python
	from src.core import KnowledgeBase, IndexConfig

	# Initialize with custom config
	config = IndexConfig(
	embedding_model="text-embedding-3-small",
	chunk_size=1024,
	use_pinecone=False,
	)

	kb = KnowledgeBase(config)

	# Index documents
	kb.index_documents("./docs")

	# Search
	results = kb.search("your query", top_k=5)

	# Query with QA
	response = kb.query("What is the main feature?")
	```

	#### `DocumentLoader` (document_loader.py)
	Load documents from various sources.

	```python
	from src.core import DocumentLoader

	# Load from directory
	docs = DocumentLoader.load_markdown_documents("./docs")
	docs += DocumentLoader.load_text_documents("./docs")

	# Load products
	products = [
	{
	"id": "prod_001",
	"name": "Product Name",
	"description": "Description",
	"price": "$99",
	"category": "Category",
	"features": ["Feature 1", "Feature 2"],
	}
	]
	product_docs = DocumentLoader.create_product_documents(products)

	# Load from URLs
	urls = ["https://example.com/page1", "https://example.com/page2"]
	url_docs = DocumentLoader.load_documents_from_urls(urls)

	# Load all at once
	all_docs = DocumentLoader.load_all_documents(
	docs_dir="./docs",
	products=products,
	urls=urls,
	)
	```

	#### `VectorSearchEngine` (vector_search.py)
	High-level search interface with advanced features.

	```python
	from src.core import VectorSearchEngine

	search_engine = VectorSearchEngine(kb)

	# Basic search
	results = search_engine.search("query", top_k=5)

	# Product search only
	products = search_engine.search_products("laptop", top_k=10)

	# Documentation search only
	docs = search_engine.search_documentation("how to setup", top_k=5)

	# Semantic search with threshold
	results = search_engine.semantic_search(
	"installation guide",
	top_k=5,
	similarity_threshold=0.5,
	)

	# Hierarchical search across types
	results = search_engine.hierarchical_search("e-commerce")
	# Returns: {"products": [...], "documentation": [...]}

	# Weighted combined search
	results = search_engine.combined_search(
	"shopping platform",
	weights={"product": 0.6, "documentation": 0.4},
	)

	# Contextual search
	results = search_engine.contextual_search(
	"laptop",
	context={"category": "electronics", "price_range": "$1000-2000"},
	top_k=5,
	)

	# Get recommendations
	recs = search_engine.get_recommendations("laptop under $1000", limit=5)
	```

	### 2. High-Level Integration

	#### `EcoMCPKnowledgeBase` (llama_integration.py)
	Complete integration for EcoMCP application.

	```python
	from src.core import EcoMCPKnowledgeBase, initialize_knowledge_base

	# Initialize
	kb = EcoMCPKnowledgeBase()

	# Auto-initialize with documents
	kb.initialize("./docs")

	# Add products
	kb.add_products(products)

	# Add URLs
	kb.add_urls(["https://example.com"])

	# Search
	results = kb.search("query", top_k=5)

	# Search specific types
	products = kb.search_products("laptop", top_k=10)
	docs = kb.search_documentation("deploy", top_k=5)

	# Get recommendations
	recs = kb.get_recommendations("gaming laptop", limit=5)

	# Natural language query
	answer = kb.query("What is the platform about?")

	# Save and load
	kb.save("./kb_index")
	kb.load("./kb_index")

	# Get stats
	stats = kb.get_stats()
	```

	### 3. Global Singleton Pattern

	```python
	from src.core import initialize_knowledge_base, get_knowledge_base

	# Initialize globally
	kb = initialize_knowledge_base("./docs")

	# Access from anywhere
	kb = get_knowledge_base()
	results = kb.search("query")
	```

	## Configuration

	### IndexConfig Options

	```python
	config = IndexConfig(
	# Embedding model (OpenAI)
	embedding_model="text-embedding-3-small", # or "text-embedding-3-large"

	# Chunking settings
	chunk_size=1024, # Size of text chunks
	chunk_overlap=20, # Overlap between chunks

	# Vector store backend
	use_pinecone=False, # True to use Pinecone
	pinecone_index_name="ecomcp-knowledge",
	pinecone_dimension=1536,
	)
	```

	## Installation

	Add to requirements.txt:
	```
	llama-index>=0.9.0
	llama-index-embeddings-openai>=0.1.0
	llama-index-vector-stores-pinecone>=0.1.0
	```

	Environment variables:
	```bash
	OPENAI_API_KEY=sk-...
	PINECONE_API_KEY=... # Optional, only if using Pinecone
	```

	## Usage Examples

	### Example 1: Basic Document Indexing

	```python
	from src.core import EcoMCPKnowledgeBase

	kb = EcoMCPKnowledgeBase()
	kb.initialize("./docs")

	# Search
	results = kb.search("deployment guide", top_k=3)
	for result in results:
	print(f"Score: {result.score:.2f}")
	print(f"Content: {result.content[:200]}")
	```

	### Example 2: Product Recommendation

	```python
	from src.core import EcoMCPKnowledgeBase

	kb = EcoMCPKnowledgeBase()

	products = [
	{
	"id": "1",
	"name": "Wireless Headphones",
	"description": "Noise-canceling",
	"price": "$299",
	"category": "Electronics",
	"features": ["ANC", "30h Battery"],
	"tags": ["audio", "wireless"]
	},
	# ... more products
	]

	kb.add_products(products)

	# Get recommendations
	recs = kb.get_recommendations("best headphones for music", limit=3)
	for rec in recs:
	print(f"Rank: {rec['rank']}")
	print(f"Confidence: {rec['confidence']:.2f}")
	```

	### Example 3: Semantic Search with Filtering

	```python
	from src.core import VectorSearchEngine

	search = VectorSearchEngine(kb)

	# Search with context
	results = search.contextual_search(
	"laptop computer",
	context={
	"category": "computers",
	"price_range": "$500-1000",
	"processor": "Intel"
	},
	top_k=5
	)
	```

	### Example 4: Knowledge Base Persistence

	```python
	from src.core import EcoMCPKnowledgeBase

	# Create and save
	kb1 = EcoMCPKnowledgeBase()
	kb1.initialize("./docs")
	kb1.save("./kb_backup")

	# Load later
	kb2 = EcoMCPKnowledgeBase()
	kb2.load("./kb_backup")

	# Use immediately
	results = kb2.search("something")
	```

	## Integration with Server

	### In Your Server/MCP Implementation

	```python
	from src.core import initialize_knowledge_base, get_knowledge_base

	# During startup
	def initialize_app():
	kb = initialize_knowledge_base("./docs")
	kb.add_products(get_all_products()) # Your product source

	# In your handlers
	def search_handler(query: str):
	kb = get_knowledge_base()
	results = kb.search(query)
	return results

	def recommend_handler(user_query: str):
	kb = get_knowledge_base()
	recommendations = kb.get_recommendations(user_query)
	return recommendations
	```

	## Advanced Features

	### Custom Metadata

	```python
	from llama_index.core.schema import Document

	doc = Document(
	text="Content here",
	metadata={
	"source": "custom_source",
	"author": "John Doe",
	"date": "2024-01-01",
	"category": "tutorial",
	}
	)
	kb.kb.add_documents([doc])
	```

	### Pinecone Integration

	```python
	config = IndexConfig(use_pinecone=True)
	kb = EcoMCPKnowledgeBase(config=config)

	# Automatically creates/uses Pinecone index
	kb.initialize("./docs")
	```

	### Custom Query Engine

	```python
	# Low-level query with custom settings
	query_engine = kb.kb.index.as_query_engine(
	similarity_top_k=10,
	response_mode="compact" # or "tree_summarize", "refine"
	)
	response = query_engine.query("Your question")
	```

	## Performance Tips

	1. Chunk Size: Larger chunks (2048) for long documents, smaller (512) for varied content
	2. Vector Store: Use Pinecone for production deployments
	3. Batch Processing: Index documents in batches for large datasets
	4. Caching: Load from disk instead of re-indexing frequently
	5. Top-K: Start with top_k=5, adjust based on relevance

	## Troubleshooting

	### No OpenAI API Key
	```
	Error: OPENAI_API_KEY not set
	Solution: Set export OPENAI_API_KEY=sk-... in environment
	```

	### Pinecone Connection Failed
	```
	Error: Pinecone connection failed
	Solution: Check PINECONE_API_KEY and network connectivity
	Falls back to in-memory indexing automatically
	```

	### Out of Memory with Large Datasets
	```
	Solution:
	- Reduce chunk_size in IndexConfig
	- Process documents in batches
	- Use Pinecone backend (scales to millions of documents)
	```

	## Testing

	Run tests:
	```bash
	pytest tests/test_llama_integration.py -v
	```

	## API Reference

	See `src/core/` for detailed API documentation in docstrings.

	## Files Structure

	```
	src/core/
	├── __init__.py # Package exports
	├── knowledge_base.py # Core KnowledgeBase class
	├── document_loader.py # Document loading utilities
	├── vector_search.py # VectorSearchEngine with advanced features
	├── llama_integration.py # EcoMCP integration wrapper
	└── examples.py # Usage examples
	```

	## Related Documentation

	- OpenAI API: https://platform.openai.com/docs
	- LlamaIndex: https://docs.llamaindex.ai
	- Pinecone: https://docs.pinecone.io