ecomcp / docs /LLAMA_FRAMEWORK_REFINED.md
vinhnx90's picture
feat: Implement LlamaIndex integration with new core modules for knowledge base, document loading, vector search, and comprehensive documentation and tests.
108d8af

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

LlamaIndex Framework Integration - Refined

Implementation refined based on official LlamaIndex framework documentation and best practices.

Key Framework Concepts Implemented

1. Ingestion Pipeline

Modern LlamaIndex Pattern: Processing documents through transformations before indexing

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.extractors import TitleExtractor, KeywordExtractor

# Pipeline automatically:
# - Parses documents into nodes
# - Extracts metadata (titles, keywords)
# - Handles deduplication
# - Manages state across runs

pipeline = IngestionPipeline(
    transformations=[
        SimpleNodeParser(chunk_size=1024, chunk_overlap=20),
        TitleExtractor(nodes=5),
        KeywordExtractor(keywords=10),
    ]
)

nodes = pipeline.run(documents=documents)

2. Storage Context

Modern LlamaIndex Pattern: Unified storage management

from llama_index.core import StorageContext, VectorStoreIndex

# Default (in-memory with local persistence)
storage_context = StorageContext.from_defaults()

# Pinecone backend
storage_context = StorageContext.from_defaults(
    vector_store=pinecone_vector_store
)

# Create index with storage context
index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
    show_progress=True
)

# Persist to disk
index.storage_context.persist(persist_dir="./kb_storage")

3. Query Engines

Modern LlamaIndex Pattern: End-to-end QA with response synthesis

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

# Create query engine with response synthesis
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact"  # Options: compact, tree_summarize, refine
)

response = query_engine.query("What is the main feature?")
# Returns: Response object with answer and source nodes

Response modes:

  • compact: Concise, single-pass synthesis
  • tree_summarize: Hierarchical summarization
  • refine: Iterative refinement across results

4. Chat Engines

Modern LlamaIndex Pattern: Multi-turn conversational interface

# Create chat engine for conversation
chat_engine = index.as_chat_engine()

# Multi-turn conversation
response = chat_engine.chat("What's the main topic?")
response = chat_engine.chat("Tell me more about it")
# Maintains conversation history automatically

5. Global Settings

Modern LlamaIndex Pattern: Centralized configuration

from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# Configure globally
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-5")
Settings.chunk_size = 1024
Settings.chunk_overlap = 20

# All components use these settings automatically

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          EcoMCPKnowledgeBase                    β”‚
β”‚  (High-level integration wrapper)               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ DocumentLoader                          β”‚   β”‚
β”‚  β”‚ - Load markdown, text, JSON, URLs       β”‚   β”‚
β”‚  β”‚ - Create product documents              β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                    β”‚                           β”‚
β”‚                    β–Ό                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ IngestionPipeline                       β”‚   β”‚
β”‚  β”‚ - Node parsing                          β”‚   β”‚
β”‚  β”‚ - Metadata extraction (title, keywords) β”‚   β”‚
β”‚  β”‚ - Transformations                       β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                    β”‚                           β”‚
β”‚                    β–Ό                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ VectorStoreIndex                        β”‚   β”‚
β”‚  β”‚ (with StorageContext)                   β”‚   β”‚
β”‚  β”‚ - In-memory or Pinecone backend         β”‚   β”‚
β”‚  β”‚ - Embeddings                            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚               β”‚                β”‚               β”‚
β”‚               β–Ό                β–Ό               β”‚
β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚        β”‚ QueryEngine β”‚  β”‚ ChatEngine   β”‚     β”‚
β”‚        β”‚ (QA mode)   β”‚  β”‚ (Conversational)   β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ VectorSearchEngine      β”‚
         β”‚ (Advanced search)       β”‚
         β”‚ - Product search        β”‚
         β”‚ - Documentation search  β”‚
         β”‚ - Semantic search       β”‚
         β”‚ - Recommendations       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Usage Patterns

Pattern 1: Question-Answering

from src.core import EcoMCPKnowledgeBase

kb = EcoMCPKnowledgeBase()
kb.initialize("./docs")

# Query with automatic response synthesis
answer = kb.query("How do I deploy this?")
print(answer)  # Returns full answer with context

Pattern 2: Conversational

kb = EcoMCPKnowledgeBase()
kb.initialize("./docs")

# Multi-turn conversation
messages = [
    {"role": "user", "content": "What are the main features?"}
]
response = kb.chat(messages)
print(response)

# Continue conversation
messages.append({"role": "assistant", "content": response})
messages.append({"role": "user", "content": "Tell me more about feature X"})
response = kb.chat(messages)

Pattern 3: Semantic Search

kb = EcoMCPKnowledgeBase()
kb.initialize("./docs")

# Get search results with scores
results = kb.search("setup guide", top_k=5)
for result in results:
    print(f"Score: {result.score:.2f}")
    print(f"Content: {result.content[:200]}")

Pattern 4: Product Recommendations

kb = EcoMCPKnowledgeBase()
products = [...]
kb.add_products(products)

# Get recommendations with confidence scores
recs = kb.get_recommendations("laptop under $1000", limit=5)
for rec in recs:
    print(f"Confidence: {rec['confidence']:.2f}")
    print(f"Product: {rec['content']}")

Configuration Best Practices

from src.core import IndexConfig, EcoMCPKnowledgeBase

# Development
dev_config = IndexConfig(
    embedding_model="text-embedding-3-small",
    llm_model="gpt-3.5-turbo",
    chunk_size=512,
    use_pinecone=False,
)

# Production
prod_config = IndexConfig(
    embedding_model="text-embedding-3-large",
    llm_model="gpt-5",
    chunk_size=1024,
    use_pinecone=True,
    pinecone_index_name="ecomcp-prod",
)

kb = EcoMCPKnowledgeBase(config=prod_config)

Response Synthesis Modes

Compact (Recommended for speed)

  • Single LLM call
  • Combines all retrieved context
  • Returns concise answer
  • Best for: Direct factual questions
query_engine = index.as_query_engine(response_mode="compact")

Tree Summarize

  • Hierarchical summarization
  • Better for complex topics
  • Multiple LLM calls
  • Best for: Complex multi-step answers
query_engine = index.as_query_engine(response_mode="tree_summarize")

Refine

  • Iteratively refines answer
  • Processes results one by one
  • Best for: Detailed, nuanced answers
  • Most token usage
query_engine = index.as_query_engine(response_mode="refine")

Integration with Server

MCP Server Handler

from src.core import initialize_knowledge_base, get_knowledge_base

# Startup
@app.on_event("startup")
def startup():
    initialize_knowledge_base("./docs")

# Query handler
@mcp.tool()
def search(query: str) -> str:
    kb = get_knowledge_base()
    results = kb.search(query, top_k=5)
    return "\n".join([r.content for r in results])

# Chat handler
@mcp.tool()
def chat(messages: List[Dict[str, str]]) -> str:
    kb = get_knowledge_base()
    return kb.chat(messages)

API Endpoint

from fastapi import FastAPI
from src.core import initialize_knowledge_base, get_knowledge_base

app = FastAPI()

@app.on_event("startup")
async def startup():
    initialize_knowledge_base("./docs")

@app.post("/search")
async def search(query: str, top_k: int = 5):
    kb = get_knowledge_base()
    results = kb.search(query, top_k=top_k)
    return [r.to_dict() for r in results]

@app.post("/query")
async def query(question: str):
    kb = get_knowledge_base()
    answer = kb.query(question)
    return {"answer": answer}

@app.post("/chat")
async def chat(messages: List[Dict[str, str]]):
    kb = get_knowledge_base()
    response = kb.chat(messages)
    return {"response": response}

Metadata Extraction

The ingestion pipeline automatically extracts:

  • Titles: Section titles and document headers
  • Keywords: Key terms and concepts
# Metadata available in search results
results = kb.search("topic")
for result in results:
    print(result.metadata)
    # {
    #   "source": "docs/guide.md",
    #   "title": "Getting Started Guide",
    #   "keywords": ["setup", "installation", "requirements"],
    #   "type": "markdown"
    # }

Performance Tuning

For Speed

config = IndexConfig(
    embedding_model="text-embedding-3-small",
    llm_model="gpt-3.5-turbo",
    chunk_size=1024,
    similarity_top_k=3,  # Fewer results
)
kb = EcoMCPKnowledgeBase(config=config)
query_engine = kb.kb.index.as_query_engine(response_mode="compact")

For Quality

config = IndexConfig(
    embedding_model="text-embedding-3-large",
    llm_model="gpt-5",
    chunk_size=512,  # Smaller chunks
    similarity_top_k=10,  # More results
)
kb = EcoMCPKnowledgeBase(config=config)
query_engine = kb.kb.index.as_query_engine(response_mode="refine")

For Production Scalability

config = IndexConfig(
    embedding_model="text-embedding-3-large",
    llm_model="gpt-5",
    chunk_size=1024,
    use_pinecone=True,
    pinecone_index_name="ecomcp-prod",
)
kb = EcoMCPKnowledgeBase(config=config)
# Pinecone automatically scales to millions of documents

Error Handling

try:
    kb = EcoMCPKnowledgeBase()
    kb.initialize("./docs")
except FileNotFoundError:
    logger.error("Documentation directory not found")
except Exception as e:
    logger.error(f"Failed to initialize knowledge base: {e}")

try:
    response = kb.query("question")
except Exception as e:
    logger.error(f"Query failed: {e}")
    return "Unable to process query"

References

Updates from Refining

βœ… Added IngestionPipeline for metadata extraction βœ… Enhanced StorageContext management βœ… Added ChatEngine for multi-turn conversation βœ… Improved Settings configuration βœ… Better response synthesis options βœ… Enhanced error handling βœ… More detailed documentation