Implementing Production-Grade RAG with Multi-Agent Systems in JavaScript

Community Article Published October 16, 2025

A technical deep-dive into building retrieval-augmented generation systems using KaibanJS, LangChain.js, and vector databases

Abstract

This article presents a practical implementation of Retrieval-Augmented Generation (RAG) using JavaScript-based AI agent frameworks. We explore the architectural separation between indexing and retrieval phases, demonstrate embedding flexibility across multiple providers (OpenAI, Cohere, HuggingFace), and examine configurable retrieval strategies. The implementation leverages KaibanJS's SimpleRAGRetrieve tool with LangChain.js ecosystem integration for vector store operations.

Key Topics:

RAG architecture with separated indexing/retrieval phases
Vector store operations and embedding strategies
Retrieval configuration: similarity search vs. MMR
Multi-agent task decomposition
Production deployment considerations

Background: RAG in Agent-Based Systems

Retrieval-Augmented Generation has become a cornerstone technique for grounding LLM responses in factual data. Traditional RAG implementations often conflate indexing and retrieval logic, leading to monolithic architectures that are difficult to scale and maintain.

Modern agent-based systems benefit from tool-oriented RAG, where retrieval capabilities are encapsulated as discrete tools that agents can invoke. This approach offers:

Composability: Agents can combine RAG with other tools (web search, APIs, calculations)
Specialization: Different agents can access different knowledge bases
Scalability: Indexing and retrieval can be scaled independently
Flexibility: Vector stores can be swapped without changing agent logic

Architecture Overview

Two-Phase RAG Pipeline

Our implementation separates RAG into distinct phases:

┌─────────────────────────────────────────────────────────────┐
│                    PHASE 1: INDEXING                         │
│                   (Offline/ETL Process)                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Raw Documents                                              │
│       ↓                                                      │
│  Text Splitting (RecursiveCharacterTextSplitter)           │
│       ↓                                                      │
│  Embedding Generation (OpenAI/Cohere/HuggingFace)          │
│       ↓                                                      │
│  Vector Store Persistence (Pinecone/Supabase/Chroma)       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                   PHASE 2: RETRIEVAL                         │
│                   (Runtime/Query Time)                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  User Query                                                 │
│       ↓                                                      │
│  Query Embedding (same model as indexing)                   │
│       ↓                                                      │
│  Vector Similarity Search (cosine/euclidean)                │
│       ↓                                                      │
│  Retrieved Contexts (top-k documents)                       │
│       ↓                                                      │
│  LLM Generation with Context                                │
│       ↓                                                      │
│  Grounded Response                                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key Design Decision: SimpleRAGRetrieve operates exclusively in Phase 2, assuming a pre-populated vector store. This architectural choice enables:

Lightweight retrieval services
Independent scaling of indexing infrastructure
Shared knowledge bases across multiple applications
Simplified deployment and testing

Implementation

Environment Setup

npm install kaibanjs @kaibanjs/tools @langchain/openai @langchain/community langchain

For alternative embedding providers:

# Cohere
npm install @langchain/cohere

# HuggingFace Inference
npm install @langchain/community

# Anthropic
npm install @langchain/anthropic

Phase 1: Vector Store Indexing

Document Preparation

For this demonstration, we use a product catalog with structured metadata. In production, this data would come from databases, APIs, or document stores.

const sampleData = [
  {
    id: 1,
    name: 'UltraBook Pro 15',
    category: 'Laptop',
    content:
      'The UltraBook Pro 15 is a premium laptop featuring a 15.6-inch 4K display, Intel i9 processor, 32GB RAM, and 1TB NVMe SSD...',
    price: 2499,
    specs: ['Intel i9', '32GB RAM', '1TB SSD', '4K Display'],
    inStock: true
  }
  // ... additional products
];

Embedding Configuration

The choice of embedding model significantly impacts retrieval quality. Consider:

OpenAI text-embedding-3-small: 1536 dimensions, cost-effective, strong general performance
OpenAI text-embedding-3-large: 3072 dimensions, highest quality, higher cost
Cohere embed-english-v3.0: 1024 dimensions, optimized for English, good retrieval performance
HuggingFace all-MiniLM-L6-v2: 384 dimensions, lightweight, self-hostable

import { OpenAIEmbeddings } from '@langchain/openai';
import { CohereEmbeddings } from '@langchain/cohere';
import { HuggingFaceInferenceEmbeddings } from '@langchain/community/embeddings/hf';

// OpenAI embeddings (default)
const openaiEmbeddings = new OpenAIEmbeddings({
  apiKey: process.env.OPENAI_API_KEY,
  modelName: 'text-embedding-3-small',
  dimensions: 1536 // Can be reduced for lower dimensionality
});

// Cohere embeddings
const cohereEmbeddings = new CohereEmbeddings({
  apiKey: process.env.COHERE_API_KEY,
  model: 'embed-english-v3.0',
  inputType: 'search_document' // Optimizes for indexing
});

// HuggingFace embeddings (self-hostable)
const hfEmbeddings = new HuggingFaceInferenceEmbeddings({
  apiKey: process.env.HUGGINGFACE_API_KEY,
  model: 'sentence-transformers/all-MiniLM-L6-v2'
});

Text Chunking Strategy

Chunking parameters critically affect retrieval quality:

import { RAGToolkit } from '@kaibanjs/tools';
import { MemoryVectorStore } from 'langchain/vectorstores/memory';

const embeddings = new OpenAIEmbeddings({
  apiKey: process.env.OPENAI_API_KEY
});

const vectorStore = new MemoryVectorStore(embeddings);

const ragToolkit = new RAGToolkit({
  embeddings,
  vectorStore,
  chunkOptions: {
    chunkSize: 500, // Characters per chunk
    chunkOverlap: 100 // Overlap to preserve context boundaries
  },
  env: { OPENAI_API_KEY: process.env.OPENAI_API_KEY }
});

Chunking Trade-offs:

Chunk Size	Pros	Cons	Best For
200-400	Precise retrieval, lower token costs	May miss broader context	Short, factual queries
500-800	Balanced context and precision	Moderate token usage	General-purpose RAG
1000-2000	Maximum context, better for complex topics	Higher token costs, potentially noisy	Technical documentation, research papers

Overlap Considerations:

Low overlap (50-100): More distinct chunks, lower storage
High overlap (200-300): Better context preservation, higher redundancy
Rule of thumb: 10-20% of chunk size

Document Indexing

const initializeVectorStore = async () => {
  const documents = sampleData.map(item => ({
    source: item.content,
    type: 'string',
    metadata: {
      id: item.id,
      name: item.name,
      category: item.category,
      price: item.price,
      specs: item.specs,
      inStock: item.inStock,
      // Augmented text for semantic search
      fullText: `${item.name} ${item.category} ${
        item.content
      } ${item.specs.join(' ')}`
    }
  }));

  await ragToolkit.addDocuments(documents);
  console.log('✅ Indexed', documents.length, 'documents');
};

await initializeVectorStore();

Metadata Design: Rich metadata enables hybrid search strategies:

Semantic search via embeddings (content similarity)
Filtered search via metadata (category, price, availability)
Combined approach for optimal precision

Phase 2: Retrieval with SimpleRAGRetrieve

Tool Configuration

import { SimpleRAGRetrieve } from '@kaibanjs/tools';

const productKnowledgeBaseTool = new SimpleRAGRetrieve({
  OPENAI_API_KEY: process.env.OPENAI_API_KEY,
  vectorStore: vectorStore, // Pre-indexed vector store
  embeddings: embeddings, // Must match indexing embeddings
  retrieverOptions: {
    k: 4, // Number of documents to retrieve
    searchType: 'similarity', // 'similarity' or 'mmr'
    scoreThreshold: 0.7, // Minimum similarity score (0-1)
    filter: undefined // Optional metadata filters
  }
});

Retrieval Strategy: Similarity vs. MMR

Similarity Search (Cosine Similarity):

retrieverOptions: {
  k: 4,
  searchType: 'similarity'
}

Returns top-k most similar documents
Fast, straightforward
May return redundant/similar results
Best for: Factual queries, specific information lookup

MMR (Maximal Marginal Relevance):

retrieverOptions: {
  k: 4,
  searchType: 'mmr',
  lambda: 0.5  // Balance between relevance (1.0) and diversity (0.0)
}

Balances relevance with diversity
Reduces redundancy in results
Slightly more computational overhead
Best for: Comparison queries, exploratory searches, broad topics

Mathematical Foundation:

Similarity search: score = cosine(query_embedding, doc_embedding)

MMR: MMR = argmax[λ · Similarity(q,d) - (1-λ) · max_d' Similarity(d,d')]

λ = 1.0: Pure similarity (no diversity)
λ = 0.5: Balanced
λ = 0.0: Pure diversity (may sacrifice relevance)

Score Thresholding

retrieverOptions: {
  k: 10,              // Consider top 10
  scoreThreshold: 0.7 // Only return docs with score ≥ 0.7
}

Benefits:

Filters low-quality matches
Prevents hallucination from irrelevant context
Adaptive result count based on query quality

Score interpretation:

0.9-1.0: Very high relevance (near-exact matches)
0.7-0.9: High relevance (topically aligned)
0.5-0.7: Moderate relevance (may be too broad)
<0.5: Low relevance (likely noise)

Metadata Filtering

Combine semantic search with structured filters:

retrieverOptions: {
  k: 4,
  filter: {
    category: 'Laptop',
    inStock: true,
    price: { $lte: 2000 }  // Vector store dependent syntax
  }
}

Hybrid Search Pattern:

Apply metadata filters (fast, deterministic)
Perform semantic search on filtered subset
Return top-k results

This dramatically improves precision for domain-specific queries.

Multi-Agent Architecture

Agent Definition

import { Agent, Task, Team } from 'kaibanjs';

const productSpecialist = new Agent({
  name: 'Product Specialist',
  role: 'Technology Product Expert',
  goal: 'Help customers find the right products by searching our knowledge base',
  background:
    'Expert in technology products with deep knowledge of specifications',
  tools: [productKnowledgeBaseTool] // RAG tool integration
});

Task Decomposition

Breaking complex queries into sequential tasks improves response quality:

// Task 1: Information Retrieval
const searchProductTask = new Task({
  description: `Search our product knowledge base to answer: {customerQuery}
  
  Focus on finding accurate product information including specifications, features, prices, and availability.`,
  expectedOutput:
    'Detailed product information that directly addresses the customer query',
  agent: productSpecialist
});

// Task 2: Analysis and Recommendation
const recommendationTask = new Task({
  description: `Based on the product information found, provide a helpful recommendation.
  
  Customer's question: {customerQuery}
  
  If comparing products, highlight key differences. If seeking recommendations, suggest the best option based on their needs.`,
  expectedOutput:
    'A clear recommendation that helps the customer make an informed decision',
  agent: productSpecialist
});

Task Design Rationale:

Task 1 focuses on retrieval accuracy (RAG-heavy)
Task 2 focuses on reasoning and synthesis (LLM-heavy)
Sequential execution ensures grounding before reasoning
Each task has clear success criteria

Team Orchestration

const team = new Team({
  name: 'Product Support Team',
  agents: [productSpecialist],
  tasks: [searchProductTask, recommendationTask],
  inputs: {
    customerQuery:
      'I need a laptop for video editing and gaming. What do you recommend?'
  },
  env: {
    OPENAI_API_KEY: process.env.OPENAI_API_KEY
  }
});

const result = await team.start();

Production Considerations

Vector Store Selection

MemoryVectorStore (Development):

import { MemoryVectorStore } from 'langchain/vectorstores/memory';
const vectorStore = new MemoryVectorStore(embeddings);

✅ Zero setup, fast for prototyping
❌ Not persistent, RAM-limited
Use case: Development, testing, small datasets (<10K docs)

Pinecone (Production - Managed):

import { PineconeStore } from '@langchain/pinecone';
import { Pinecone } from '@pinecone-database/pinecone';

const pinecone = new Pinecone({
  apiKey: process.env.PINECONE_API_KEY
});

const pineconeIndex = pinecone.Index('products-index');
const vectorStore = await PineconeStore.fromExistingIndex(embeddings, {
  pineconeIndex,
  namespace: 'products' // Multi-tenancy support
});

const retriever = new SimpleRAGRetrieve({
  OPENAI_API_KEY: process.env.OPENAI_API_KEY,
  vectorStore: vectorStore,
  embeddings: embeddings,
  retrieverOptions: {
    k: 4,
    searchType: 'similarity',
    filter: { namespace: 'products' }
  }
});

✅ Fully managed, scales to billions of vectors
✅ Low latency (<100ms p95)
✅ Metadata filtering, namespaces
❌ Cost scales with vector count
Use case: Production applications, large-scale deployments

Supabase (Production - Open Source):

import { SupabaseVectorStore } from '@langchain/community/vectorstores/supabase';
import { createClient } from '@supabase/supabase-js';

const supabaseClient = createClient(
  process.env.SUPABASE_URL,
  process.env.SUPABASE_PRIVATE_KEY
);

const vectorStore = await SupabaseVectorStore.fromExistingIndex(embeddings, {
  client: supabaseClient,
  tableName: 'documents',
  queryName: 'match_documents'
});

✅ Self-hostable, PostgreSQL + pgvector
✅ Integrated with auth, storage, real-time
✅ Cost-effective for moderate scale
❌ Requires PostgreSQL management
Use case: Full-stack applications, self-hosted deployments

Chroma (Development/Local Production):

import { Chroma } from '@langchain/community/vectorstores/chroma';

const vectorStore = await Chroma.fromExistingCollection(embeddings, {
  collectionName: 'products',
  url: process.env.CHROMA_URL || 'http://localhost:8000'
});

✅ Lightweight, easy to run locally
✅ Good for development clusters
❌ Less mature than alternatives
Use case: Local development, on-premise deployments

Embedding Model Selection Matrix

Model	Dimensions	Speed	Cost	Quality	Best For
OpenAI text-embedding-3-small	1536	Fast	$0.02/1M tokens	High	General-purpose, production
OpenAI text-embedding-3-large	3072	Medium	$0.13/1M tokens	Highest	Quality-critical applications
Cohere embed-english-v3.0	1024	Fast	$0.10/1M tokens	High	English-only, good retrieval
HuggingFace all-MiniLM-L6-v2	384	Very Fast	Free*	Good	Self-hosted, cost-sensitive
HuggingFace all-mpnet-base-v2	768	Fast	Free*	Better	Self-hosted, quality balance

*Free if self-hosted; Inference API has rate limits

Performance Optimization

1. Embedding Caching:

// Cache query embeddings for common queries
const queryCache = new Map();

async function getCachedEmbedding(text) {
  if (!queryCache.has(text)) {
    const embedding = await embeddings.embedQuery(text);
    queryCache.set(text, embedding);
  }
  return queryCache.get(text);
}

2. Batch Operations:

// Index in batches for better throughput
const BATCH_SIZE = 100;
for (let i = 0; i < documents.length; i += BATCH_SIZE) {
  const batch = documents.slice(i, i + BATCH_SIZE);
  await ragToolkit.addDocuments(batch);
}

3. Retrieval Tuning:

// Adjust k based on query complexity
const adaptiveK = (query) => {
  const tokens = query.split(' ').length;
  if (tokens < 5) return 3;      // Simple query
  if (tokens < 15) return 5;     // Medium query
  return 8;                       // Complex query
};

retrieverOptions: {
  k: adaptiveK(customerQuery),
  scoreThreshold: 0.75
}

Advanced Patterns

Multi-Vector Store RAG

Use different vector stores for different knowledge domains:

const productRAG = new SimpleRAGRetrieve({
  OPENAI_API_KEY: process.env.OPENAI_API_KEY,
  vectorStore: productVectorStore,
  embeddings: embeddings
});

const documentationRAG = new SimpleRAGRetrieve({
  OPENAI_API_KEY: process.env.OPENAI_API_KEY,
  vectorStore: docsVectorStore,
  embeddings: embeddings
});

const reviewsRAG = new SimpleRAGRetrieve({
  OPENAI_API_KEY: process.env.OPENAI_API_KEY,
  vectorStore: reviewsVectorStore,
  embeddings: embeddings
});

// Specialized agents for each domain
const productAgent = new Agent({
  name: 'Product Expert',
  tools: [productRAG]
});

const supportAgent = new Agent({
  name: 'Support Specialist',
  tools: [documentationRAG]
});

const sentimentAgent = new Agent({
  name: 'Review Analyst',
  tools: [reviewsRAG]
});

Reranking for Improved Precision

// After initial retrieval, rerank with a cross-encoder
import { CohereRerank } from '@langchain/cohere';

const reranker = new CohereRerank({
  apiKey: process.env.COHERE_API_KEY,
  model: 'rerank-english-v2.0',
  topN: 3 // Return top 3 after reranking
});

// In your retrieval pipeline:
// 1. Retrieve k=10 candidates with SimpleRAGRetrieve
// 2. Rerank to top 3 most relevant
// 3. Use reranked results for LLM context

Benefits:

Initial retrieval: Fast, bi-encoder (embedding similarity)
Reranking: More accurate, cross-encoder (query-doc interaction)
Best of both worlds: Speed + accuracy

Evaluation and Monitoring

Retrieval Quality Metrics

1. Precision@k:

// Percentage of retrieved docs that are relevant
function precisionAtK(retrievedDocs, relevantDocs, k) {
  const topK = retrievedDocs.slice(0, k);
  const relevant = topK.filter(doc => relevantDocs.includes(doc.id));
  return relevant.length / k;
}

2. Recall@k:

// Percentage of relevant docs that were retrieved
function recallAtK(retrievedDocs, relevantDocs, k) {
  const topK = retrievedDocs.slice(0, k);
  const relevant = topK.filter(doc => relevantDocs.includes(doc.id));
  return relevant.length / relevantDocs.length;
}

3. MRR (Mean Reciprocal Rank):

// Position of first relevant result
function mrr(retrievedDocs, relevantDocs) {
  const firstRelevantIndex = retrievedDocs.findIndex(doc =>
    relevantDocs.includes(doc.id)
  );
  return firstRelevantIndex >= 0 ? 1 / (firstRelevantIndex + 1) : 0;
}

Production Monitoring

// Log retrieval metrics
const logRetrieval = (query, results, latency) => {
  console.log({
    timestamp: new Date().toISOString(),
    query,
    numResults: results.length,
    avgScore: results.reduce((sum, r) => sum + r.score, 0) / results.length,
    latencyMs: latency,
    hasResults: results.length > 0
  });
};

// Track failed retrievals (low scores)
if (results.every(r => r.score < 0.6)) {
  console.warn('Low-quality retrieval detected:', query);
  // Trigger alert or fallback behavior
}

Comparison: SimpleRAGRetrieve vs. Alternative Approaches

Aspect	SimpleRAGRetrieve	LangChain RetrievalQA	Custom RAG Implementation
Indexing/Retrieval Separation	✅ Separate	❌ Combined	⚙️ Your choice
LangChain.js Compatibility	✅ Full	✅ Native	⚙️ Manual integration
Agent Integration	✅ Tool-based	❌ Chain-based	⚙️ Custom
Multi-vector store	✅ Easy	⚙️ Requires multiple chains	⚙️ Custom
Configuration Simplicity	✅ High	⚙️ Medium	❌ Low
Flexibility	⚙️ Medium	⚙️ Medium	✅ Maximum
Production-Ready	✅ Yes	✅ Yes	⚙️ Depends

When to use SimpleRAGRetrieve:

✅ Building agent-based systems
✅ Pre-indexed vector stores
✅ Need LangChain.js ecosystem compatibility
✅ Want separation between indexing and retrieval

When to use alternatives:

❌ Need custom retrieval logic beyond standard similarity/MMR
❌ Building chain-based (non-agent) applications
❌ Require features not exposed by SimpleRAGRetrieve API

Conclusion

SimpleRAGRetrieve provides a production-ready abstraction for retrieval-augmented generation in agent-based systems. By focusing exclusively on the retrieval phase and leveraging the LangChain.js ecosystem, it enables:

Architectural clarity: Clear separation between indexing and retrieval
Flexibility: Compatible with any LangChain.js embeddings and vector stores
Agent integration: First-class support for multi-agent workflows
Production readiness: Configurable retrieval strategies and vector store options

The demonstrated implementation showcases these capabilities through a product knowledge base, but the patterns extend to any domain requiring grounded LLM responses: documentation search, customer support, research assistants, and more.

Key Takeaways:

Use RAGToolkit for indexing, SimpleRAGRetrieve for retrieval
Match embedding models between indexing and retrieval phases
Tune chunking and retrieval parameters for your domain
Consider MMR for diverse results, similarity for precision
Monitor retrieval quality in production

Resources

Tags: #rag #retrieval-augmented-generation #vector-databases #embeddings #ai-agents #langchain #javascript #kaibanjs #semantic-search #nlp

Author's Note: This implementation was tested with Node.js 18+ and the package versions specified in the dependencies. Feedback and contributions welcome on GitHub.

Run KaibanJS Multi-Agent Teams Inside OpenClaw as a Native Tool

April 3, 2026

Integrate Any Agent Backend with OpenClaw via the OpenResponses API

March 13, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote