Implementing Production-Grade RAG with Multi-Agent Systems in JavaScript
A technical deep-dive into building retrieval-augmented generation systems using KaibanJS, LangChain.js, and vector databases
Abstract
This article presents a practical implementation of Retrieval-Augmented Generation (RAG) using JavaScript-based AI agent frameworks. We explore the architectural separation between indexing and retrieval phases, demonstrate embedding flexibility across multiple providers (OpenAI, Cohere, HuggingFace), and examine configurable retrieval strategies. The implementation leverages KaibanJS's SimpleRAGRetrieve tool with LangChain.js ecosystem integration for vector store operations.
Key Topics:
- RAG architecture with separated indexing/retrieval phases
- Vector store operations and embedding strategies
- Retrieval configuration: similarity search vs. MMR
- Multi-agent task decomposition
- Production deployment considerations
Background: RAG in Agent-Based Systems
Retrieval-Augmented Generation has become a cornerstone technique for grounding LLM responses in factual data. Traditional RAG implementations often conflate indexing and retrieval logic, leading to monolithic architectures that are difficult to scale and maintain.
Modern agent-based systems benefit from tool-oriented RAG, where retrieval capabilities are encapsulated as discrete tools that agents can invoke. This approach offers:
- Composability: Agents can combine RAG with other tools (web search, APIs, calculations)
- Specialization: Different agents can access different knowledge bases
- Scalability: Indexing and retrieval can be scaled independently
- Flexibility: Vector stores can be swapped without changing agent logic
Architecture Overview
Two-Phase RAG Pipeline
Our implementation separates RAG into distinct phases:
┌─────────────────────────────────────────────────────────────┐
│ PHASE 1: INDEXING │
│ (Offline/ETL Process) │
├─────────────────────────────────────────────────────────────┤
│ │
│ Raw Documents │
│ ↓ │
│ Text Splitting (RecursiveCharacterTextSplitter) │
│ ↓ │
│ Embedding Generation (OpenAI/Cohere/HuggingFace) │
│ ↓ │
│ Vector Store Persistence (Pinecone/Supabase/Chroma) │
│ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ PHASE 2: RETRIEVAL │
│ (Runtime/Query Time) │
├─────────────────────────────────────────────────────────────┤
│ │
│ User Query │
│ ↓ │
│ Query Embedding (same model as indexing) │
│ ↓ │
│ Vector Similarity Search (cosine/euclidean) │
│ ↓ │
│ Retrieved Contexts (top-k documents) │
│ ↓ │
│ LLM Generation with Context │
│ ↓ │
│ Grounded Response │
│ │
└─────────────────────────────────────────────────────────────┘
Key Design Decision: SimpleRAGRetrieve operates exclusively in Phase 2, assuming a pre-populated vector store. This architectural choice enables:
- Lightweight retrieval services
- Independent scaling of indexing infrastructure
- Shared knowledge bases across multiple applications
- Simplified deployment and testing
Implementation
Environment Setup
npm install kaibanjs @kaibanjs/tools @langchain/openai @langchain/community langchain
For alternative embedding providers:
# Cohere
npm install @langchain/cohere
# HuggingFace Inference
npm install @langchain/community
# Anthropic
npm install @langchain/anthropic
Phase 1: Vector Store Indexing
Document Preparation
For this demonstration, we use a product catalog with structured metadata. In production, this data would come from databases, APIs, or document stores.
const sampleData = [
{
id: 1,
name: 'UltraBook Pro 15',
category: 'Laptop',
content:
'The UltraBook Pro 15 is a premium laptop featuring a 15.6-inch 4K display, Intel i9 processor, 32GB RAM, and 1TB NVMe SSD...',
price: 2499,
specs: ['Intel i9', '32GB RAM', '1TB SSD', '4K Display'],
inStock: true
}
// ... additional products
];
Embedding Configuration
The choice of embedding model significantly impacts retrieval quality. Consider:
- OpenAI
text-embedding-3-small: 1536 dimensions, cost-effective, strong general performance - OpenAI
text-embedding-3-large: 3072 dimensions, highest quality, higher cost - Cohere
embed-english-v3.0: 1024 dimensions, optimized for English, good retrieval performance - HuggingFace
all-MiniLM-L6-v2: 384 dimensions, lightweight, self-hostable
import { OpenAIEmbeddings } from '@langchain/openai';
import { CohereEmbeddings } from '@langchain/cohere';
import { HuggingFaceInferenceEmbeddings } from '@langchain/community/embeddings/hf';
// OpenAI embeddings (default)
const openaiEmbeddings = new OpenAIEmbeddings({
apiKey: process.env.OPENAI_API_KEY,
modelName: 'text-embedding-3-small',
dimensions: 1536 // Can be reduced for lower dimensionality
});
// Cohere embeddings
const cohereEmbeddings = new CohereEmbeddings({
apiKey: process.env.COHERE_API_KEY,
model: 'embed-english-v3.0',
inputType: 'search_document' // Optimizes for indexing
});
// HuggingFace embeddings (self-hostable)
const hfEmbeddings = new HuggingFaceInferenceEmbeddings({
apiKey: process.env.HUGGINGFACE_API_KEY,
model: 'sentence-transformers/all-MiniLM-L6-v2'
});
Text Chunking Strategy
Chunking parameters critically affect retrieval quality:
import { RAGToolkit } from '@kaibanjs/tools';
import { MemoryVectorStore } from 'langchain/vectorstores/memory';
const embeddings = new OpenAIEmbeddings({
apiKey: process.env.OPENAI_API_KEY
});
const vectorStore = new MemoryVectorStore(embeddings);
const ragToolkit = new RAGToolkit({
embeddings,
vectorStore,
chunkOptions: {
chunkSize: 500, // Characters per chunk
chunkOverlap: 100 // Overlap to preserve context boundaries
},
env: { OPENAI_API_KEY: process.env.OPENAI_API_KEY }
});
Chunking Trade-offs:
| Chunk Size | Pros | Cons | Best For |
|---|---|---|---|
| 200-400 | Precise retrieval, lower token costs | May miss broader context | Short, factual queries |
| 500-800 | Balanced context and precision | Moderate token usage | General-purpose RAG |
| 1000-2000 | Maximum context, better for complex topics | Higher token costs, potentially noisy | Technical documentation, research papers |
Overlap Considerations:
- Low overlap (50-100): More distinct chunks, lower storage
- High overlap (200-300): Better context preservation, higher redundancy
- Rule of thumb: 10-20% of chunk size
Document Indexing
const initializeVectorStore = async () => {
const documents = sampleData.map(item => ({
source: item.content,
type: 'string',
metadata: {
id: item.id,
name: item.name,
category: item.category,
price: item.price,
specs: item.specs,
inStock: item.inStock,
// Augmented text for semantic search
fullText: `${item.name} ${item.category} ${
item.content
} ${item.specs.join(' ')}`
}
}));
await ragToolkit.addDocuments(documents);
console.log('✅ Indexed', documents.length, 'documents');
};
await initializeVectorStore();
Metadata Design: Rich metadata enables hybrid search strategies:
- Semantic search via embeddings (content similarity)
- Filtered search via metadata (category, price, availability)
- Combined approach for optimal precision
Phase 2: Retrieval with SimpleRAGRetrieve
Tool Configuration
import { SimpleRAGRetrieve } from '@kaibanjs/tools';
const productKnowledgeBaseTool = new SimpleRAGRetrieve({
OPENAI_API_KEY: process.env.OPENAI_API_KEY,
vectorStore: vectorStore, // Pre-indexed vector store
embeddings: embeddings, // Must match indexing embeddings
retrieverOptions: {
k: 4, // Number of documents to retrieve
searchType: 'similarity', // 'similarity' or 'mmr'
scoreThreshold: 0.7, // Minimum similarity score (0-1)
filter: undefined // Optional metadata filters
}
});
Retrieval Strategy: Similarity vs. MMR
Similarity Search (Cosine Similarity):
retrieverOptions: {
k: 4,
searchType: 'similarity'
}
- Returns top-k most similar documents
- Fast, straightforward
- May return redundant/similar results
- Best for: Factual queries, specific information lookup
MMR (Maximal Marginal Relevance):
retrieverOptions: {
k: 4,
searchType: 'mmr',
lambda: 0.5 // Balance between relevance (1.0) and diversity (0.0)
}
- Balances relevance with diversity
- Reduces redundancy in results
- Slightly more computational overhead
- Best for: Comparison queries, exploratory searches, broad topics
Mathematical Foundation:
Similarity search: score = cosine(query_embedding, doc_embedding)
MMR: MMR = argmax[λ · Similarity(q,d) - (1-λ) · max_d' Similarity(d,d')]
λ = 1.0: Pure similarity (no diversity)λ = 0.5: Balancedλ = 0.0: Pure diversity (may sacrifice relevance)
Score Thresholding
retrieverOptions: {
k: 10, // Consider top 10
scoreThreshold: 0.7 // Only return docs with score ≥ 0.7
}
Benefits:
- Filters low-quality matches
- Prevents hallucination from irrelevant context
- Adaptive result count based on query quality
Score interpretation:
0.9-1.0: Very high relevance (near-exact matches)0.7-0.9: High relevance (topically aligned)0.5-0.7: Moderate relevance (may be too broad)<0.5: Low relevance (likely noise)
Metadata Filtering
Combine semantic search with structured filters:
retrieverOptions: {
k: 4,
filter: {
category: 'Laptop',
inStock: true,
price: { $lte: 2000 } // Vector store dependent syntax
}
}
Hybrid Search Pattern:
- Apply metadata filters (fast, deterministic)
- Perform semantic search on filtered subset
- Return top-k results
This dramatically improves precision for domain-specific queries.
Multi-Agent Architecture
Agent Definition
import { Agent, Task, Team } from 'kaibanjs';
const productSpecialist = new Agent({
name: 'Product Specialist',
role: 'Technology Product Expert',
goal: 'Help customers find the right products by searching our knowledge base',
background:
'Expert in technology products with deep knowledge of specifications',
tools: [productKnowledgeBaseTool] // RAG tool integration
});
Task Decomposition
Breaking complex queries into sequential tasks improves response quality:
// Task 1: Information Retrieval
const searchProductTask = new Task({
description: `Search our product knowledge base to answer: {customerQuery}
Focus on finding accurate product information including specifications, features, prices, and availability.`,
expectedOutput:
'Detailed product information that directly addresses the customer query',
agent: productSpecialist
});
// Task 2: Analysis and Recommendation
const recommendationTask = new Task({
description: `Based on the product information found, provide a helpful recommendation.
Customer's question: {customerQuery}
If comparing products, highlight key differences. If seeking recommendations, suggest the best option based on their needs.`,
expectedOutput:
'A clear recommendation that helps the customer make an informed decision',
agent: productSpecialist
});
Task Design Rationale:
- Task 1 focuses on retrieval accuracy (RAG-heavy)
- Task 2 focuses on reasoning and synthesis (LLM-heavy)
- Sequential execution ensures grounding before reasoning
- Each task has clear success criteria
Team Orchestration
const team = new Team({
name: 'Product Support Team',
agents: [productSpecialist],
tasks: [searchProductTask, recommendationTask],
inputs: {
customerQuery:
'I need a laptop for video editing and gaming. What do you recommend?'
},
env: {
OPENAI_API_KEY: process.env.OPENAI_API_KEY
}
});
const result = await team.start();
Production Considerations
Vector Store Selection
MemoryVectorStore (Development):
import { MemoryVectorStore } from 'langchain/vectorstores/memory';
const vectorStore = new MemoryVectorStore(embeddings);
- ✅ Zero setup, fast for prototyping
- ❌ Not persistent, RAM-limited
- Use case: Development, testing, small datasets (<10K docs)
Pinecone (Production - Managed):
import { PineconeStore } from '@langchain/pinecone';
import { Pinecone } from '@pinecone-database/pinecone';
const pinecone = new Pinecone({
apiKey: process.env.PINECONE_API_KEY
});
const pineconeIndex = pinecone.Index('products-index');
const vectorStore = await PineconeStore.fromExistingIndex(embeddings, {
pineconeIndex,
namespace: 'products' // Multi-tenancy support
});
const retriever = new SimpleRAGRetrieve({
OPENAI_API_KEY: process.env.OPENAI_API_KEY,
vectorStore: vectorStore,
embeddings: embeddings,
retrieverOptions: {
k: 4,
searchType: 'similarity',
filter: { namespace: 'products' }
}
});
- ✅ Fully managed, scales to billions of vectors
- ✅ Low latency (<100ms p95)
- ✅ Metadata filtering, namespaces
- ❌ Cost scales with vector count
- Use case: Production applications, large-scale deployments
Supabase (Production - Open Source):
import { SupabaseVectorStore } from '@langchain/community/vectorstores/supabase';
import { createClient } from '@supabase/supabase-js';
const supabaseClient = createClient(
process.env.SUPABASE_URL,
process.env.SUPABASE_PRIVATE_KEY
);
const vectorStore = await SupabaseVectorStore.fromExistingIndex(embeddings, {
client: supabaseClient,
tableName: 'documents',
queryName: 'match_documents'
});
- ✅ Self-hostable, PostgreSQL + pgvector
- ✅ Integrated with auth, storage, real-time
- ✅ Cost-effective for moderate scale
- ❌ Requires PostgreSQL management
- Use case: Full-stack applications, self-hosted deployments
Chroma (Development/Local Production):
import { Chroma } from '@langchain/community/vectorstores/chroma';
const vectorStore = await Chroma.fromExistingCollection(embeddings, {
collectionName: 'products',
url: process.env.CHROMA_URL || 'http://localhost:8000'
});
- ✅ Lightweight, easy to run locally
- ✅ Good for development clusters
- ❌ Less mature than alternatives
- Use case: Local development, on-premise deployments
Embedding Model Selection Matrix
| Model | Dimensions | Speed | Cost | Quality | Best For |
|---|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | Fast | $0.02/1M tokens | High | General-purpose, production |
| OpenAI text-embedding-3-large | 3072 | Medium | $0.13/1M tokens | Highest | Quality-critical applications |
| Cohere embed-english-v3.0 | 1024 | Fast | $0.10/1M tokens | High | English-only, good retrieval |
| HuggingFace all-MiniLM-L6-v2 | 384 | Very Fast | Free* | Good | Self-hosted, cost-sensitive |
| HuggingFace all-mpnet-base-v2 | 768 | Fast | Free* | Better | Self-hosted, quality balance |
*Free if self-hosted; Inference API has rate limits
Performance Optimization
1. Embedding Caching:
// Cache query embeddings for common queries
const queryCache = new Map();
async function getCachedEmbedding(text) {
if (!queryCache.has(text)) {
const embedding = await embeddings.embedQuery(text);
queryCache.set(text, embedding);
}
return queryCache.get(text);
}
2. Batch Operations:
// Index in batches for better throughput
const BATCH_SIZE = 100;
for (let i = 0; i < documents.length; i += BATCH_SIZE) {
const batch = documents.slice(i, i + BATCH_SIZE);
await ragToolkit.addDocuments(batch);
}
3. Retrieval Tuning:
// Adjust k based on query complexity
const adaptiveK = (query) => {
const tokens = query.split(' ').length;
if (tokens < 5) return 3; // Simple query
if (tokens < 15) return 5; // Medium query
return 8; // Complex query
};
retrieverOptions: {
k: adaptiveK(customerQuery),
scoreThreshold: 0.75
}
Advanced Patterns
Multi-Vector Store RAG
Use different vector stores for different knowledge domains:
const productRAG = new SimpleRAGRetrieve({
OPENAI_API_KEY: process.env.OPENAI_API_KEY,
vectorStore: productVectorStore,
embeddings: embeddings
});
const documentationRAG = new SimpleRAGRetrieve({
OPENAI_API_KEY: process.env.OPENAI_API_KEY,
vectorStore: docsVectorStore,
embeddings: embeddings
});
const reviewsRAG = new SimpleRAGRetrieve({
OPENAI_API_KEY: process.env.OPENAI_API_KEY,
vectorStore: reviewsVectorStore,
embeddings: embeddings
});
// Specialized agents for each domain
const productAgent = new Agent({
name: 'Product Expert',
tools: [productRAG]
});
const supportAgent = new Agent({
name: 'Support Specialist',
tools: [documentationRAG]
});
const sentimentAgent = new Agent({
name: 'Review Analyst',
tools: [reviewsRAG]
});
Reranking for Improved Precision
// After initial retrieval, rerank with a cross-encoder
import { CohereRerank } from '@langchain/cohere';
const reranker = new CohereRerank({
apiKey: process.env.COHERE_API_KEY,
model: 'rerank-english-v2.0',
topN: 3 // Return top 3 after reranking
});
// In your retrieval pipeline:
// 1. Retrieve k=10 candidates with SimpleRAGRetrieve
// 2. Rerank to top 3 most relevant
// 3. Use reranked results for LLM context
Benefits:
- Initial retrieval: Fast, bi-encoder (embedding similarity)
- Reranking: More accurate, cross-encoder (query-doc interaction)
- Best of both worlds: Speed + accuracy
Evaluation and Monitoring
Retrieval Quality Metrics
1. Precision@k:
// Percentage of retrieved docs that are relevant
function precisionAtK(retrievedDocs, relevantDocs, k) {
const topK = retrievedDocs.slice(0, k);
const relevant = topK.filter(doc => relevantDocs.includes(doc.id));
return relevant.length / k;
}
2. Recall@k:
// Percentage of relevant docs that were retrieved
function recallAtK(retrievedDocs, relevantDocs, k) {
const topK = retrievedDocs.slice(0, k);
const relevant = topK.filter(doc => relevantDocs.includes(doc.id));
return relevant.length / relevantDocs.length;
}
3. MRR (Mean Reciprocal Rank):
// Position of first relevant result
function mrr(retrievedDocs, relevantDocs) {
const firstRelevantIndex = retrievedDocs.findIndex(doc =>
relevantDocs.includes(doc.id)
);
return firstRelevantIndex >= 0 ? 1 / (firstRelevantIndex + 1) : 0;
}
Production Monitoring
// Log retrieval metrics
const logRetrieval = (query, results, latency) => {
console.log({
timestamp: new Date().toISOString(),
query,
numResults: results.length,
avgScore: results.reduce((sum, r) => sum + r.score, 0) / results.length,
latencyMs: latency,
hasResults: results.length > 0
});
};
// Track failed retrievals (low scores)
if (results.every(r => r.score < 0.6)) {
console.warn('Low-quality retrieval detected:', query);
// Trigger alert or fallback behavior
}
Comparison: SimpleRAGRetrieve vs. Alternative Approaches
| Aspect | SimpleRAGRetrieve | LangChain RetrievalQA | Custom RAG Implementation |
|---|---|---|---|
| Indexing/Retrieval Separation | ✅ Separate | ❌ Combined | ⚙️ Your choice |
| LangChain.js Compatibility | ✅ Full | ✅ Native | ⚙️ Manual integration |
| Agent Integration | ✅ Tool-based | ❌ Chain-based | ⚙️ Custom |
| Multi-vector store | ✅ Easy | ⚙️ Requires multiple chains | ⚙️ Custom |
| Configuration Simplicity | ✅ High | ⚙️ Medium | ❌ Low |
| Flexibility | ⚙️ Medium | ⚙️ Medium | ✅ Maximum |
| Production-Ready | ✅ Yes | ✅ Yes | ⚙️ Depends |
When to use SimpleRAGRetrieve:
- ✅ Building agent-based systems
- ✅ Pre-indexed vector stores
- ✅ Need LangChain.js ecosystem compatibility
- ✅ Want separation between indexing and retrieval
When to use alternatives:
- ❌ Need custom retrieval logic beyond standard similarity/MMR
- ❌ Building chain-based (non-agent) applications
- ❌ Require features not exposed by SimpleRAGRetrieve API
Conclusion
SimpleRAGRetrieve provides a production-ready abstraction for retrieval-augmented generation in agent-based systems. By focusing exclusively on the retrieval phase and leveraging the LangChain.js ecosystem, it enables:
- Architectural clarity: Clear separation between indexing and retrieval
- Flexibility: Compatible with any LangChain.js embeddings and vector stores
- Agent integration: First-class support for multi-agent workflows
- Production readiness: Configurable retrieval strategies and vector store options
The demonstrated implementation showcases these capabilities through a product knowledge base, but the patterns extend to any domain requiring grounded LLM responses: documentation search, customer support, research assistants, and more.
Key Takeaways:
- Use RAGToolkit for indexing, SimpleRAGRetrieve for retrieval
- Match embedding models between indexing and retrieval phases
- Tune chunking and retrieval parameters for your domain
- Consider MMR for diverse results, similarity for precision
- Monitor retrieval quality in production
Resources
- 📚 SimpleRAGRetrieve Documentation
- 🚀 KaibanJS Framework
- 🔗 LangChain.js Integration Guide
- 💾 Vector Store Comparison
- 🧪 Full Code Example
Tags: #rag #retrieval-augmented-generation #vector-databases #embeddings #ai-agents #langchain #javascript #kaibanjs #semantic-search #nlp
Author's Note: This implementation was tested with Node.js 18+ and the package versions specified in the dependencies. Feedback and contributions welcome on GitHub.