Implementing Production-Grade RAG with Multi-Agent Systems in JavaScript

Community Article Published October 16, 2025

SimpleRagRetrieve Post

A technical deep-dive into building retrieval-augmented generation systems using KaibanJS, LangChain.js, and vector databases


Abstract

This article presents a practical implementation of Retrieval-Augmented Generation (RAG) using JavaScript-based AI agent frameworks. We explore the architectural separation between indexing and retrieval phases, demonstrate embedding flexibility across multiple providers (OpenAI, Cohere, HuggingFace), and examine configurable retrieval strategies. The implementation leverages KaibanJS's SimpleRAGRetrieve tool with LangChain.js ecosystem integration for vector store operations.

Key Topics:

  • RAG architecture with separated indexing/retrieval phases
  • Vector store operations and embedding strategies
  • Retrieval configuration: similarity search vs. MMR
  • Multi-agent task decomposition
  • Production deployment considerations

Background: RAG in Agent-Based Systems

Retrieval-Augmented Generation has become a cornerstone technique for grounding LLM responses in factual data. Traditional RAG implementations often conflate indexing and retrieval logic, leading to monolithic architectures that are difficult to scale and maintain.

Modern agent-based systems benefit from tool-oriented RAG, where retrieval capabilities are encapsulated as discrete tools that agents can invoke. This approach offers:

  1. Composability: Agents can combine RAG with other tools (web search, APIs, calculations)
  2. Specialization: Different agents can access different knowledge bases
  3. Scalability: Indexing and retrieval can be scaled independently
  4. Flexibility: Vector stores can be swapped without changing agent logic

Architecture Overview

Two-Phase RAG Pipeline

Our implementation separates RAG into distinct phases:

┌─────────────────────────────────────────────────────────────┐
│                    PHASE 1: INDEXING                         │
│                   (Offline/ETL Process)                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Raw Documents                                              │
│       ↓                                                      │
│  Text Splitting (RecursiveCharacterTextSplitter)           │
│       ↓                                                      │
│  Embedding Generation (OpenAI/Cohere/HuggingFace)          │
│       ↓                                                      │
│  Vector Store Persistence (Pinecone/Supabase/Chroma)       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                   PHASE 2: RETRIEVAL                         │
│                   (Runtime/Query Time)                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  User Query                                                 │
│       ↓                                                      │
│  Query Embedding (same model as indexing)                   │
│       ↓                                                      │
│  Vector Similarity Search (cosine/euclidean)                │
│       ↓                                                      │
│  Retrieved Contexts (top-k documents)                       │
│       ↓                                                      │
│  LLM Generation with Context                                │
│       ↓                                                      │
│  Grounded Response                                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key Design Decision: SimpleRAGRetrieve operates exclusively in Phase 2, assuming a pre-populated vector store. This architectural choice enables:

  • Lightweight retrieval services
  • Independent scaling of indexing infrastructure
  • Shared knowledge bases across multiple applications
  • Simplified deployment and testing

Implementation

Environment Setup

npm install kaibanjs @kaibanjs/tools @langchain/openai @langchain/community langchain

For alternative embedding providers:

# Cohere
npm install @langchain/cohere

# HuggingFace Inference
npm install @langchain/community

# Anthropic
npm install @langchain/anthropic

Phase 1: Vector Store Indexing

Document Preparation

For this demonstration, we use a product catalog with structured metadata. In production, this data would come from databases, APIs, or document stores.

const sampleData = [
  {
    id: 1,
    name: 'UltraBook Pro 15',
    category: 'Laptop',
    content:
      'The UltraBook Pro 15 is a premium laptop featuring a 15.6-inch 4K display, Intel i9 processor, 32GB RAM, and 1TB NVMe SSD...',
    price: 2499,
    specs: ['Intel i9', '32GB RAM', '1TB SSD', '4K Display'],
    inStock: true
  }
  // ... additional products
];

Embedding Configuration

The choice of embedding model significantly impacts retrieval quality. Consider:

  • OpenAI text-embedding-3-small: 1536 dimensions, cost-effective, strong general performance
  • OpenAI text-embedding-3-large: 3072 dimensions, highest quality, higher cost
  • Cohere embed-english-v3.0: 1024 dimensions, optimized for English, good retrieval performance
  • HuggingFace all-MiniLM-L6-v2: 384 dimensions, lightweight, self-hostable
import { OpenAIEmbeddings } from '@langchain/openai';
import { CohereEmbeddings } from '@langchain/cohere';
import { HuggingFaceInferenceEmbeddings } from '@langchain/community/embeddings/hf';

// OpenAI embeddings (default)
const openaiEmbeddings = new OpenAIEmbeddings({
  apiKey: process.env.OPENAI_API_KEY,
  modelName: 'text-embedding-3-small',
  dimensions: 1536 // Can be reduced for lower dimensionality
});

// Cohere embeddings
const cohereEmbeddings = new CohereEmbeddings({
  apiKey: process.env.COHERE_API_KEY,
  model: 'embed-english-v3.0',
  inputType: 'search_document' // Optimizes for indexing
});

// HuggingFace embeddings (self-hostable)
const hfEmbeddings = new HuggingFaceInferenceEmbeddings({
  apiKey: process.env.HUGGINGFACE_API_KEY,
  model: 'sentence-transformers/all-MiniLM-L6-v2'
});

Text Chunking Strategy

Chunking parameters critically affect retrieval quality:

import { RAGToolkit } from '@kaibanjs/tools';
import { MemoryVectorStore } from 'langchain/vectorstores/memory';

const embeddings = new OpenAIEmbeddings({
  apiKey: process.env.OPENAI_API_KEY
});

const vectorStore = new MemoryVectorStore(embeddings);

const ragToolkit = new RAGToolkit({
  embeddings,
  vectorStore,
  chunkOptions: {
    chunkSize: 500, // Characters per chunk
    chunkOverlap: 100 // Overlap to preserve context boundaries
  },
  env: { OPENAI_API_KEY: process.env.OPENAI_API_KEY }
});

Chunking Trade-offs:

Chunk Size Pros Cons Best For
200-400 Precise retrieval, lower token costs May miss broader context Short, factual queries
500-800 Balanced context and precision Moderate token usage General-purpose RAG
1000-2000 Maximum context, better for complex topics Higher token costs, potentially noisy Technical documentation, research papers

Overlap Considerations:

  • Low overlap (50-100): More distinct chunks, lower storage
  • High overlap (200-300): Better context preservation, higher redundancy
  • Rule of thumb: 10-20% of chunk size

Document Indexing

const initializeVectorStore = async () => {
  const documents = sampleData.map(item => ({
    source: item.content,
    type: 'string',
    metadata: {
      id: item.id,
      name: item.name,
      category: item.category,
      price: item.price,
      specs: item.specs,
      inStock: item.inStock,
      // Augmented text for semantic search
      fullText: `${item.name} ${item.category} ${
        item.content
      } ${item.specs.join(' ')}`
    }
  }));

  await ragToolkit.addDocuments(documents);
  console.log('✅ Indexed', documents.length, 'documents');
};

await initializeVectorStore();

Metadata Design: Rich metadata enables hybrid search strategies:

  • Semantic search via embeddings (content similarity)
  • Filtered search via metadata (category, price, availability)
  • Combined approach for optimal precision

Phase 2: Retrieval with SimpleRAGRetrieve

Tool Configuration

import { SimpleRAGRetrieve } from '@kaibanjs/tools';

const productKnowledgeBaseTool = new SimpleRAGRetrieve({
  OPENAI_API_KEY: process.env.OPENAI_API_KEY,
  vectorStore: vectorStore, // Pre-indexed vector store
  embeddings: embeddings, // Must match indexing embeddings
  retrieverOptions: {
    k: 4, // Number of documents to retrieve
    searchType: 'similarity', // 'similarity' or 'mmr'
    scoreThreshold: 0.7, // Minimum similarity score (0-1)
    filter: undefined // Optional metadata filters
  }
});

Retrieval Strategy: Similarity vs. MMR

Similarity Search (Cosine Similarity):

retrieverOptions: {
  k: 4,
  searchType: 'similarity'
}
  • Returns top-k most similar documents
  • Fast, straightforward
  • May return redundant/similar results
  • Best for: Factual queries, specific information lookup

MMR (Maximal Marginal Relevance):

retrieverOptions: {
  k: 4,
  searchType: 'mmr',
  lambda: 0.5  // Balance between relevance (1.0) and diversity (0.0)
}
  • Balances relevance with diversity
  • Reduces redundancy in results
  • Slightly more computational overhead
  • Best for: Comparison queries, exploratory searches, broad topics

Mathematical Foundation:

Similarity search: score = cosine(query_embedding, doc_embedding)

MMR: MMR = argmax[λ · Similarity(q,d) - (1-λ) · max_d' Similarity(d,d')]

  • λ = 1.0: Pure similarity (no diversity)
  • λ = 0.5: Balanced
  • λ = 0.0: Pure diversity (may sacrifice relevance)

Score Thresholding

retrieverOptions: {
  k: 10,              // Consider top 10
  scoreThreshold: 0.7 // Only return docs with score ≥ 0.7
}

Benefits:

  • Filters low-quality matches
  • Prevents hallucination from irrelevant context
  • Adaptive result count based on query quality

Score interpretation:

  • 0.9-1.0: Very high relevance (near-exact matches)
  • 0.7-0.9: High relevance (topically aligned)
  • 0.5-0.7: Moderate relevance (may be too broad)
  • <0.5: Low relevance (likely noise)

Metadata Filtering

Combine semantic search with structured filters:

retrieverOptions: {
  k: 4,
  filter: {
    category: 'Laptop',
    inStock: true,
    price: { $lte: 2000 }  // Vector store dependent syntax
  }
}

Hybrid Search Pattern:

  1. Apply metadata filters (fast, deterministic)
  2. Perform semantic search on filtered subset
  3. Return top-k results

This dramatically improves precision for domain-specific queries.


Multi-Agent Architecture

Agent Definition

import { Agent, Task, Team } from 'kaibanjs';

const productSpecialist = new Agent({
  name: 'Product Specialist',
  role: 'Technology Product Expert',
  goal: 'Help customers find the right products by searching our knowledge base',
  background:
    'Expert in technology products with deep knowledge of specifications',
  tools: [productKnowledgeBaseTool] // RAG tool integration
});

Task Decomposition

Breaking complex queries into sequential tasks improves response quality:

// Task 1: Information Retrieval
const searchProductTask = new Task({
  description: `Search our product knowledge base to answer: {customerQuery}
  
  Focus on finding accurate product information including specifications, features, prices, and availability.`,
  expectedOutput:
    'Detailed product information that directly addresses the customer query',
  agent: productSpecialist
});

// Task 2: Analysis and Recommendation
const recommendationTask = new Task({
  description: `Based on the product information found, provide a helpful recommendation.
  
  Customer's question: {customerQuery}
  
  If comparing products, highlight key differences. If seeking recommendations, suggest the best option based on their needs.`,
  expectedOutput:
    'A clear recommendation that helps the customer make an informed decision',
  agent: productSpecialist
});

Task Design Rationale:

  • Task 1 focuses on retrieval accuracy (RAG-heavy)
  • Task 2 focuses on reasoning and synthesis (LLM-heavy)
  • Sequential execution ensures grounding before reasoning
  • Each task has clear success criteria

Team Orchestration

const team = new Team({
  name: 'Product Support Team',
  agents: [productSpecialist],
  tasks: [searchProductTask, recommendationTask],
  inputs: {
    customerQuery:
      'I need a laptop for video editing and gaming. What do you recommend?'
  },
  env: {
    OPENAI_API_KEY: process.env.OPENAI_API_KEY
  }
});

const result = await team.start();

Production Considerations

Vector Store Selection

MemoryVectorStore (Development):

import { MemoryVectorStore } from 'langchain/vectorstores/memory';
const vectorStore = new MemoryVectorStore(embeddings);
  • ✅ Zero setup, fast for prototyping
  • ❌ Not persistent, RAM-limited
  • Use case: Development, testing, small datasets (<10K docs)

Pinecone (Production - Managed):

import { PineconeStore } from '@langchain/pinecone';
import { Pinecone } from '@pinecone-database/pinecone';

const pinecone = new Pinecone({
  apiKey: process.env.PINECONE_API_KEY
});

const pineconeIndex = pinecone.Index('products-index');
const vectorStore = await PineconeStore.fromExistingIndex(embeddings, {
  pineconeIndex,
  namespace: 'products' // Multi-tenancy support
});

const retriever = new SimpleRAGRetrieve({
  OPENAI_API_KEY: process.env.OPENAI_API_KEY,
  vectorStore: vectorStore,
  embeddings: embeddings,
  retrieverOptions: {
    k: 4,
    searchType: 'similarity',
    filter: { namespace: 'products' }
  }
});
  • ✅ Fully managed, scales to billions of vectors
  • ✅ Low latency (<100ms p95)
  • ✅ Metadata filtering, namespaces
  • ❌ Cost scales with vector count
  • Use case: Production applications, large-scale deployments

Supabase (Production - Open Source):

import { SupabaseVectorStore } from '@langchain/community/vectorstores/supabase';
import { createClient } from '@supabase/supabase-js';

const supabaseClient = createClient(
  process.env.SUPABASE_URL,
  process.env.SUPABASE_PRIVATE_KEY
);

const vectorStore = await SupabaseVectorStore.fromExistingIndex(embeddings, {
  client: supabaseClient,
  tableName: 'documents',
  queryName: 'match_documents'
});
  • ✅ Self-hostable, PostgreSQL + pgvector
  • ✅ Integrated with auth, storage, real-time
  • ✅ Cost-effective for moderate scale
  • ❌ Requires PostgreSQL management
  • Use case: Full-stack applications, self-hosted deployments

Chroma (Development/Local Production):

import { Chroma } from '@langchain/community/vectorstores/chroma';

const vectorStore = await Chroma.fromExistingCollection(embeddings, {
  collectionName: 'products',
  url: process.env.CHROMA_URL || 'http://localhost:8000'
});
  • ✅ Lightweight, easy to run locally
  • ✅ Good for development clusters
  • ❌ Less mature than alternatives
  • Use case: Local development, on-premise deployments

Embedding Model Selection Matrix

Model Dimensions Speed Cost Quality Best For
OpenAI text-embedding-3-small 1536 Fast $0.02/1M tokens High General-purpose, production
OpenAI text-embedding-3-large 3072 Medium $0.13/1M tokens Highest Quality-critical applications
Cohere embed-english-v3.0 1024 Fast $0.10/1M tokens High English-only, good retrieval
HuggingFace all-MiniLM-L6-v2 384 Very Fast Free* Good Self-hosted, cost-sensitive
HuggingFace all-mpnet-base-v2 768 Fast Free* Better Self-hosted, quality balance

*Free if self-hosted; Inference API has rate limits

Performance Optimization

1. Embedding Caching:

// Cache query embeddings for common queries
const queryCache = new Map();

async function getCachedEmbedding(text) {
  if (!queryCache.has(text)) {
    const embedding = await embeddings.embedQuery(text);
    queryCache.set(text, embedding);
  }
  return queryCache.get(text);
}

2. Batch Operations:

// Index in batches for better throughput
const BATCH_SIZE = 100;
for (let i = 0; i < documents.length; i += BATCH_SIZE) {
  const batch = documents.slice(i, i + BATCH_SIZE);
  await ragToolkit.addDocuments(batch);
}

3. Retrieval Tuning:

// Adjust k based on query complexity
const adaptiveK = (query) => {
  const tokens = query.split(' ').length;
  if (tokens < 5) return 3;      // Simple query
  if (tokens < 15) return 5;     // Medium query
  return 8;                       // Complex query
};

retrieverOptions: {
  k: adaptiveK(customerQuery),
  scoreThreshold: 0.75
}

Advanced Patterns

Multi-Vector Store RAG

Use different vector stores for different knowledge domains:

const productRAG = new SimpleRAGRetrieve({
  OPENAI_API_KEY: process.env.OPENAI_API_KEY,
  vectorStore: productVectorStore,
  embeddings: embeddings
});

const documentationRAG = new SimpleRAGRetrieve({
  OPENAI_API_KEY: process.env.OPENAI_API_KEY,
  vectorStore: docsVectorStore,
  embeddings: embeddings
});

const reviewsRAG = new SimpleRAGRetrieve({
  OPENAI_API_KEY: process.env.OPENAI_API_KEY,
  vectorStore: reviewsVectorStore,
  embeddings: embeddings
});

// Specialized agents for each domain
const productAgent = new Agent({
  name: 'Product Expert',
  tools: [productRAG]
});

const supportAgent = new Agent({
  name: 'Support Specialist',
  tools: [documentationRAG]
});

const sentimentAgent = new Agent({
  name: 'Review Analyst',
  tools: [reviewsRAG]
});

Reranking for Improved Precision

// After initial retrieval, rerank with a cross-encoder
import { CohereRerank } from '@langchain/cohere';

const reranker = new CohereRerank({
  apiKey: process.env.COHERE_API_KEY,
  model: 'rerank-english-v2.0',
  topN: 3 // Return top 3 after reranking
});

// In your retrieval pipeline:
// 1. Retrieve k=10 candidates with SimpleRAGRetrieve
// 2. Rerank to top 3 most relevant
// 3. Use reranked results for LLM context

Benefits:

  • Initial retrieval: Fast, bi-encoder (embedding similarity)
  • Reranking: More accurate, cross-encoder (query-doc interaction)
  • Best of both worlds: Speed + accuracy

Evaluation and Monitoring

Retrieval Quality Metrics

1. Precision@k:

// Percentage of retrieved docs that are relevant
function precisionAtK(retrievedDocs, relevantDocs, k) {
  const topK = retrievedDocs.slice(0, k);
  const relevant = topK.filter(doc => relevantDocs.includes(doc.id));
  return relevant.length / k;
}

2. Recall@k:

// Percentage of relevant docs that were retrieved
function recallAtK(retrievedDocs, relevantDocs, k) {
  const topK = retrievedDocs.slice(0, k);
  const relevant = topK.filter(doc => relevantDocs.includes(doc.id));
  return relevant.length / relevantDocs.length;
}

3. MRR (Mean Reciprocal Rank):

// Position of first relevant result
function mrr(retrievedDocs, relevantDocs) {
  const firstRelevantIndex = retrievedDocs.findIndex(doc =>
    relevantDocs.includes(doc.id)
  );
  return firstRelevantIndex >= 0 ? 1 / (firstRelevantIndex + 1) : 0;
}

Production Monitoring

// Log retrieval metrics
const logRetrieval = (query, results, latency) => {
  console.log({
    timestamp: new Date().toISOString(),
    query,
    numResults: results.length,
    avgScore: results.reduce((sum, r) => sum + r.score, 0) / results.length,
    latencyMs: latency,
    hasResults: results.length > 0
  });
};

// Track failed retrievals (low scores)
if (results.every(r => r.score < 0.6)) {
  console.warn('Low-quality retrieval detected:', query);
  // Trigger alert or fallback behavior
}

Comparison: SimpleRAGRetrieve vs. Alternative Approaches

Aspect SimpleRAGRetrieve LangChain RetrievalQA Custom RAG Implementation
Indexing/Retrieval Separation ✅ Separate ❌ Combined ⚙️ Your choice
LangChain.js Compatibility ✅ Full ✅ Native ⚙️ Manual integration
Agent Integration ✅ Tool-based ❌ Chain-based ⚙️ Custom
Multi-vector store ✅ Easy ⚙️ Requires multiple chains ⚙️ Custom
Configuration Simplicity ✅ High ⚙️ Medium ❌ Low
Flexibility ⚙️ Medium ⚙️ Medium ✅ Maximum
Production-Ready ✅ Yes ✅ Yes ⚙️ Depends

When to use SimpleRAGRetrieve:

  • ✅ Building agent-based systems
  • ✅ Pre-indexed vector stores
  • ✅ Need LangChain.js ecosystem compatibility
  • ✅ Want separation between indexing and retrieval

When to use alternatives:

  • ❌ Need custom retrieval logic beyond standard similarity/MMR
  • ❌ Building chain-based (non-agent) applications
  • ❌ Require features not exposed by SimpleRAGRetrieve API

Conclusion

SimpleRAGRetrieve provides a production-ready abstraction for retrieval-augmented generation in agent-based systems. By focusing exclusively on the retrieval phase and leveraging the LangChain.js ecosystem, it enables:

  1. Architectural clarity: Clear separation between indexing and retrieval
  2. Flexibility: Compatible with any LangChain.js embeddings and vector stores
  3. Agent integration: First-class support for multi-agent workflows
  4. Production readiness: Configurable retrieval strategies and vector store options

The demonstrated implementation showcases these capabilities through a product knowledge base, but the patterns extend to any domain requiring grounded LLM responses: documentation search, customer support, research assistants, and more.

Key Takeaways:

  • Use RAGToolkit for indexing, SimpleRAGRetrieve for retrieval
  • Match embedding models between indexing and retrieval phases
  • Tune chunking and retrieval parameters for your domain
  • Consider MMR for diverse results, similarity for precision
  • Monitor retrieval quality in production

Resources


Tags: #rag #retrieval-augmented-generation #vector-databases #embeddings #ai-agents #langchain #javascript #kaibanjs #semantic-search #nlp

Author's Note: This implementation was tested with Node.js 18+ and the package versions specified in the dependencies. Feedback and contributions welcome on GitHub.

Community

Sign up or log in to comment