| # The LLM Wrapper | |
| **Part 1: Foundation - Lesson 3** | |
| > Wrapping node-llama-cpp as a Runnable for seamless integration | |
| ## Overview | |
| In Lesson 1, you learned about Runnables - the composable interface. In Lesson 2, you mastered Messages - the data structures. Now we'll connect these concepts by wrapping **node-llama-cpp** (our local LLM) as a Runnable that understands Messages. | |
| By the end of this lesson, you'll have a LLM wrapper that can generate text, handle conversations, stream responses, and integrate seamlessly with chains. | |
| ## Why Does This Matter? | |
| ### The Problem: LLMs Don't Compose | |
| node-llama-cpp is excellent at what it does - running local LLMs efficiently. But when you're building agents, you need more than just an LLM. You need components that work together seamlessly. | |
| **Without a composable framework:** | |
| ```javascript | |
| import { getLlama } from 'node-llama-cpp'; | |
| // Each component is isolated - they don't know about each other | |
| async function myAgent(userInput) { | |
| // Step 1: Format the prompt | |
| const prompt = myCustomFormatter(userInput); | |
| // Step 2: Call the LLM | |
| const llama = await getLlama(); | |
| const model = await llama.loadModel({ modelPath: './model.gguf' }); | |
| const response = await model.createCompletion(prompt); | |
| // Step 3: Parse the response | |
| const parsed = myCustomParser(response); | |
| // Step 4: Maybe call a tool? | |
| if (parsed.needsTool) { | |
| const toolResult = await myTool(parsed.args); | |
| // Now what? Call the LLM again? How do we loop? | |
| // How do we add logging? Memory? Retries? | |
| } | |
| return parsed; | |
| } | |
| // Problems: | |
| // - Can't reuse components | |
| // - Can't chain operations | |
| // - Hard to add logging, metrics, or debugging | |
| // - Complex control flow for agents | |
| // - Every new feature requires changing everything | |
| ``` | |
| **With a composable framework:** | |
| ```javascript | |
| // Components that work together | |
| const llm = new LlamaCppLLM({ modelPath: './model.gguf' }); | |
| // Simple usage | |
| const response = await llm.invoke([ | |
| new SystemMessage("You are helpful"), | |
| new HumanMessage("Hi") | |
| ]); | |
| // Returns: AIMessage("Hello! How can I help you?") | |
| // But the real power is composition | |
| const agent = promptTemplate | |
| .pipe(llm) | |
| .pipe(outputParser) | |
| .pipe(toolExecutor); | |
| // Now you can: | |
| // β Reuse components in different chains | |
| // β Add logging with callbacks (no code changes) | |
| // β Build complex agents that use tools | |
| // β Test each component independently | |
| // β Swap LLMs without rewriting everything | |
| ``` | |
| ### What the Wrapper Provides | |
| The LLM wrapper isn't about making node-llama-cpp easier - it's about making it **work with everything else**: | |
| 1. **Common Interface**: Same `invoke()` / `stream()` / `batch()` as every other component | |
| 2. **Message Support**: Understands HumanMessage, AIMessage, SystemMessage | |
| 3. **Composability**: Works with `.pipe()` to chain operations | |
| 4. **Observability**: Callbacks work automatically for logging/metrics | |
| 5. **Configuration**: Runtime settings pass through cleanly | |
| 6. **History Isolation**: Proper batch processing without contamination | |
| Think of it as an adapter that lets node-llama-cpp play nicely with the rest of your agent system. | |
| ## Learning Objectives | |
| By the end of this lesson, you will: | |
| - β Understand how to wrap complex libraries as Runnables | |
| - β Convert Messages to LLM chat history | |
| - β Handle model loading and lifecycle | |
| - β Implement streaming for real-time output | |
| - β Add temperature and other generation parameters | |
| - β Manage context windows and chat history | |
| - β Handle batch processing with history isolation | |
| ## Core Concepts | |
| ### What is an LLM Wrapper? | |
| An LLM wrapper is an abstraction layer that: | |
| 1. **Hides complexity** - No need to manage contexts, sessions, or cleanup | |
| 2. **Provides a standard interface** - Same API regardless of underlying model | |
| 3. **Handles conversion** - Transforms Messages into model-specific chat history | |
| 4. **Manages resources** - Automatic initialization and cleanup | |
| 5. **Enables composition** - Works seamlessly in chains | |
| 6. **Isolates state** - Prevents history contamination in batch processing | |
| ### The Wrapper's Responsibilities | |
| ``` | |
| Input (Messages) | |
| β | |
| [1. Convert to Chat History] | |
| β | |
| [2. Manage System Prompt] | |
| β | |
| [3. Call LLM] | |
| β | |
| [4. Parse Response] | |
| β | |
| Output (AIMessage) | |
| ``` | |
| ### Key Challenges | |
| 1. **Model Loading**: Models are large and slow to load | |
| 2. **Chat History Format**: Convert Messages to node-llama-cpp format | |
| 3. **System Prompt Management**: Clear and set for each call | |
| 4. **Context Management**: Limited context windows | |
| 5. **Streaming**: Real-time output is complex | |
| 6. **Batch Isolation**: Prevent history contamination | |
| 7. **Error Handling**: Models can fail in various ways | |
| 8. **Chat Wrappers**: Different models need different formats | |
| ## Implementation Deep Dive | |
| Let's build the LLM wrapper step by step. | |
| ### Step 1: The Base Structure | |
| **Location:** `src/llm/llama-cpp-llm.js` | |
| ```javascript | |
| import { Runnable } from './runnable.js'; | |
| import { AIMessage } from './message.js'; | |
| import { getLlama, LlamaChatSession } from 'node-llama-cpp'; | |
| export class LlamaCppLLM extends Runnable { | |
| constructor(options = {}) { | |
| super(); | |
| // Model configuration | |
| this.modelPath = options.modelPath; | |
| this.temperature = options.temperature ?? 0.7; | |
| this.maxTokens = options.maxTokens ?? 2048; | |
| this.contextSize = options.contextSize ?? 4096; | |
| // Chat wrapper configuration (auto-detects by default) | |
| this.chatWrapper = options.chatWrapper ?? 'auto'; | |
| // Internal state | |
| this._llama = null; | |
| this._model = null; | |
| this._context = null; | |
| this._chatSession = null; | |
| this._initialized = false; | |
| } | |
| async _call(input, config) { | |
| // Will implement next | |
| } | |
| } | |
| ``` | |
| **Key decisions**: | |
| - Stores configuration (temperature, max tokens, etc.) | |
| - Supports custom chat wrappers (e.g., QwenChatWrapper) | |
| - Tracks internal state (model, context, session) | |
| - Lazy initialization (load on first use) | |
| ### Step 2: Model Initialization with Chat Wrapper Support | |
| ```javascript | |
| export class LlamaCppLLM extends Runnable { | |
| // ... constructor ... | |
| /** | |
| * Initialize the model (lazy loading) | |
| */ | |
| async _initialize() { | |
| if (this._initialized) return; | |
| if (this.verbose) { | |
| console.log(`Loading model: ${this.modelPath}`); | |
| } | |
| try { | |
| // Step 1: Get llama instance | |
| this._llama = await getLlama(); | |
| // Step 2: Load the model | |
| this._model = await this._llama.loadModel({ | |
| modelPath: this.modelPath | |
| }); | |
| // Step 3: Create context (working memory) | |
| this._context = await this._model.createContext({ | |
| contextSize: this.contextSize, | |
| batchSize: this.batchSize | |
| }); | |
| // Step 4: Create chat session with optional chat wrapper | |
| const contextSequence = this._context.getSequence(); | |
| const sessionConfig = { contextSequence }; | |
| // Add custom chat wrapper if specified | |
| if (this.chatWrapper !== 'auto') { | |
| sessionConfig.chatWrapper = this.chatWrapper; | |
| } | |
| this._chatSession = new LlamaChatSession(sessionConfig); | |
| this._initialized = true; | |
| if (this.verbose) { | |
| console.log('β Model loaded successfully'); | |
| if (this.chatWrapper !== 'auto') { | |
| console.log(`β Using custom chat wrapper: ${this.chatWrapper.constructor.name}`); | |
| } | |
| } | |
| } catch (error) { | |
| throw new Error( | |
| `Failed to initialize model at ${this.modelPath}: ${error.message}` | |
| ); | |
| } | |
| } | |
| /** | |
| * Cleanup resources | |
| */ | |
| async dispose() { | |
| if (this._context) { | |
| await this._context.dispose(); | |
| this._context = null; | |
| } | |
| if (this._model) { | |
| await this._model.dispose(); | |
| this._model = null; | |
| } | |
| this._chatSession = null; | |
| this._initialized = false; | |
| if (this.verbose) { | |
| console.log('β Model resources disposed'); | |
| } | |
| } | |
| } | |
| ``` | |
| **Why lazy loading?** | |
| - Models take 5-30 seconds to load | |
| - Don't load until actually needed | |
| - Share one loaded model across multiple calls | |
| **Chat Wrapper Support**: | |
| - Defaults to 'auto' (library auto-detects) | |
| - Supports custom wrappers like QwenChatWrapper for specific models | |
| - Useful for controlling model behavior (e.g., discouraging thoughts) | |
| ### Step 3: Converting Messages to Chat History | |
| ```javascript | |
| export class LlamaCppLLM extends Runnable { | |
| // ... previous code ... | |
| /** | |
| * Convert our Message objects to node-llama-cpp chat history format | |
| */ | |
| _messagesToChatHistory(messages) { | |
| return messages.map(msg => { | |
| // System messages: instructions for the AI | |
| if (msg._type === 'system') { | |
| return { type: 'system', text: msg.content }; | |
| } | |
| // Human messages: user input | |
| else if (msg._type === 'human') { | |
| return { type: 'user', text: msg.content }; | |
| } | |
| // AI messages: previous AI responses | |
| else if (msg._type === 'ai') { | |
| return { type: 'model', response: msg.content }; | |
| } | |
| // Tool messages: results from tool execution | |
| else if (msg._type === 'tool') { | |
| return { type: 'system', text: `Tool Result: ${msg.content}` }; | |
| } | |
| // Fallback: treat unknown types as user messages | |
| return { type: 'user', text: msg.content }; | |
| }); | |
| } | |
| } | |
| ``` | |
| **Key insight**: This bridges between your standardized Message types and what node-llama-cpp expects. Different models may need different chat formats, which is why chat wrappers exist. | |
| ### Step 4: The Main Generation Method | |
| ```javascript | |
| export class LlamaCppLLM extends Runnable { | |
| // ... previous code ... | |
| async _call(input, config = {}) { | |
| // Initialize if needed | |
| await this._initialize(); | |
| // Clear history if requested (important for batch processing) | |
| if (config.clearHistory) { | |
| this._chatSession.setChatHistory([]); | |
| } | |
| // Handle different input types | |
| let messages; | |
| if (typeof input === 'string') { | |
| messages = [new HumanMessage(input)]; | |
| } else if (Array.isArray(input)) { | |
| messages = input; | |
| } else { | |
| throw new Error('Input must be string or array of messages'); | |
| } | |
| // Extract system message if present | |
| const systemMessages = messages.filter(msg => msg._type === 'system'); | |
| const systemPrompt = systemMessages.length > 0 | |
| ? systemMessages[0].content | |
| : ''; | |
| // Convert our Message objects to llama.cpp format | |
| const chatHistory = this._messagesToChatHistory(messages); | |
| this._chatSession.setChatHistory(chatHistory); | |
| // ALWAYS set system prompt (either new value or empty string to clear) | |
| this._chatSession.systemPrompt = systemPrompt; | |
| try { | |
| // Build prompt options | |
| const promptOptions = { | |
| temperature: config.temperature ?? this.temperature, | |
| topP: config.topP ?? this.topP, | |
| topK: config.topK ?? this.topK, | |
| maxTokens: config.maxTokens ?? this.maxTokens, | |
| repeatPenalty: config.repeatPenalty ?? this.repeatPenalty, | |
| customStopTriggers: config.stopStrings ?? this.stopStrings | |
| }; | |
| // Add random seed if temperature > 0 and no seed specified | |
| // This ensures randomness works properly | |
| if (promptOptions.temperature > 0 && config.seed === undefined) { | |
| promptOptions.seed = Math.floor(Math.random() * 1000000); | |
| } else if (config.seed !== undefined) { | |
| promptOptions.seed = config.seed; | |
| } | |
| // Generate response using prompt | |
| const response = await this._chatSession.prompt('', promptOptions); | |
| // Return as AIMessage for consistency | |
| return new AIMessage(response); | |
| } catch (error) { | |
| throw new Error(`Generation failed: ${error.message}`); | |
| } | |
| } | |
| } | |
| ``` | |
| **Critical details**: | |
| - Always clears and sets system prompt (prevents contamination) | |
| - Adds random seed for proper temperature behavior | |
| - Uses `customStopTriggers` (correct parameter name) | |
| - Supports `clearHistory` for batch processing | |
| ### Step 5: Batch Processing with History Isolation | |
| ```javascript | |
| export class LlamaCppLLM extends Runnable { | |
| // ... previous code ... | |
| /** | |
| * Batch processing with history isolation | |
| * | |
| * Processes multiple inputs sequentially, ensuring each gets | |
| * a clean chat history to prevent contamination. | |
| */ | |
| async batch(inputs, config = {}) { | |
| const results = []; | |
| for (const input of inputs) { | |
| // Clear history before each batch item | |
| const result = await this._call(input, { | |
| ...config, | |
| clearHistory: true | |
| }); | |
| results.push(result); | |
| } | |
| return results; | |
| } | |
| } | |
| ``` | |
| **Why sequential processing?** | |
| - Local models can't run truly in parallel | |
| - Sequential ensures proper history isolation | |
| - Each item gets a clean slate | |
| ### Step 6: Streaming Support | |
| For real-time output (like ChatGPT's typing effect): | |
| ```javascript | |
| export class LlamaCppLLM extends Runnable { | |
| // ... previous code ... | |
| async *_stream(input, config = {}) { | |
| await this._initialize(); | |
| // Clear history if requested | |
| if (config.clearHistory) { | |
| this._chatSession.setChatHistory([]); | |
| } | |
| // Handle input types (same as _call) | |
| let messages; | |
| if (typeof input === 'string') { | |
| messages = [new HumanMessage(input)]; | |
| } else if (Array.isArray(input)) { | |
| messages = input; | |
| } else { | |
| throw new Error('Input must be string or array of messages'); | |
| } | |
| // Extract system message | |
| const systemMessages = messages.filter(msg => msg._type === 'system'); | |
| const systemPrompt = systemMessages.length > 0 | |
| ? systemMessages[0].content | |
| : ''; | |
| // Set up chat history | |
| const chatHistory = this._messagesToChatHistory(messages); | |
| this._chatSession.setChatHistory(chatHistory); | |
| // ALWAYS set system prompt | |
| this._chatSession.systemPrompt = systemPrompt; | |
| try { | |
| // Build prompt options | |
| const promptOptions = { | |
| temperature: config.temperature ?? this.temperature, | |
| topP: config.topP ?? this.topP, | |
| topK: config.topK ?? this.topK, | |
| maxTokens: config.maxTokens ?? this.maxTokens, | |
| repeatPenalty: config.repeatPenalty ?? this.repeatPenalty, | |
| customStopTriggers: config.stopStrings ?? this.stopStrings | |
| }; | |
| // Add random seed | |
| if (promptOptions.temperature > 0 && config.seed === undefined) { | |
| promptOptions.seed = Math.floor(Math.random() * 1000000); | |
| } else if (config.seed !== undefined) { | |
| promptOptions.seed = config.seed; | |
| } | |
| // Use onTextChunk callback to collect chunks | |
| const self = this; | |
| promptOptions.onTextChunk = (chunk) => { | |
| self._currentStreamChunks = self._currentStreamChunks || []; | |
| self._currentStreamChunks.push(chunk); | |
| }; | |
| // Initialize chunk collection | |
| this._currentStreamChunks = []; | |
| // Start generation | |
| const responsePromise = this._chatSession.prompt('', promptOptions); | |
| // Yield chunks as they become available | |
| let lastYieldedIndex = 0; | |
| // Poll for new chunks | |
| while (true) { | |
| // Yield any new chunks | |
| while (lastYieldedIndex < this._currentStreamChunks.length) { | |
| yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], { | |
| additionalKwargs: { chunk: true } | |
| }); | |
| lastYieldedIndex++; | |
| } | |
| // Check if generation is complete | |
| const isDone = await Promise.race([ | |
| responsePromise.then(() => true), | |
| new Promise(resolve => setTimeout(() => resolve(false), 10)) | |
| ]); | |
| if (isDone) { | |
| // Yield any remaining chunks | |
| while (lastYieldedIndex < this._currentStreamChunks.length) { | |
| yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], { | |
| additionalKwargs: { chunk: true } | |
| }); | |
| lastYieldedIndex++; | |
| } | |
| break; | |
| } | |
| } | |
| // Wait for completion | |
| await responsePromise; | |
| // Clean up | |
| delete this._currentStreamChunks; | |
| } catch (error) { | |
| throw new Error(`Streaming failed: ${error.message}`); | |
| } | |
| } | |
| } | |
| ``` | |
| **Streaming challenges**: | |
| - `onTextChunk` is a synchronous callback | |
| - Can't yield directly from callback | |
| - Use polling mechanism to yield as chunks arrive | |
| - 10ms polling interval balances responsiveness vs CPU usage | |
| ## Real-World Examples | |
| ### Example 1: Simple Text Generation | |
| ```javascript | |
| const llm = new LlamaCppLLM({ | |
| modelPath: './models/Meta-Llama-3.1-8B-Instruct-Q5_K_S.gguf', | |
| temperature: 0.7, | |
| maxTokens: 100 | |
| }); | |
| // Simple string input | |
| const response = await llm.invoke("What is 2+2?"); | |
| console.log(response.content); // "2+2 equals 4." | |
| ``` | |
| ### Example 2: Conversation with System Prompt | |
| ```javascript | |
| const messages = [ | |
| new SystemMessage("You are a helpful math tutor."), | |
| new HumanMessage("What is 5*5?") | |
| ]; | |
| const response = await llm.invoke(messages); | |
| console.log(response.content); | |
| // "5 times 5 is 25. Here's a simple explanation..." | |
| ``` | |
| ### Example 3: Using Qwen Chat Wrapper | |
| ```javascript | |
| import { QwenChatWrapper } from 'node-llama-cpp'; | |
| const llm = new LlamaCppLLM({ | |
| modelPath: './models/Qwen3-1.7B-Q6_K.gguf', | |
| temperature: 0.7, | |
| chatWrapper: new QwenChatWrapper({ | |
| thoughts: 'discourage' // Prevents thinking tokens | |
| }) | |
| }); | |
| const response = await llm.invoke("What is AI?"); | |
| // Response won't include <think> tokens | |
| ``` | |
| ### Example 4: Temperature Comparison | |
| ```javascript | |
| const question = "Give me one adjective to describe winter:"; | |
| // Low temperature - consistent answers | |
| llm._chatSession.setChatHistory([]); | |
| const lowTemp = await llm.invoke(question, { temperature: 0.1 }); | |
| // Likely: "cold" | |
| // High temperature - varied answers | |
| llm._chatSession.setChatHistory([]); | |
| const highTemp = await llm.invoke(question, { temperature: 0.9 }); | |
| // Could be: "frosty", "snowy", "icy", "chilly" | |
| ``` | |
| ### Example 5: Streaming Output | |
| ```javascript | |
| console.log('Response: '); | |
| for await (const chunk of llm.stream("Tell me a fun fact about space")) { | |
| process.stdout.write(chunk.content); // No newline | |
| } | |
| console.log('\n'); | |
| // Output streams in real-time as it's generated | |
| ``` | |
| ### Example 6: Batch Processing | |
| ```javascript | |
| const questions = [ | |
| "What is Python?", | |
| "What is JavaScript?", | |
| "What is Rust?" | |
| ]; | |
| const answers = await llm.batch(questions); | |
| questions.forEach((q, i) => { | |
| console.log(`Q: ${q}`); | |
| console.log(`A: ${answers[i].content}`); | |
| console.log(); | |
| }); | |
| // Each answer is independent - no history contamination! | |
| ``` | |
| ### Example 7: Using in a Pipeline | |
| ```javascript | |
| import { PromptTemplate } from '../prompts/prompt-template.js'; | |
| const prompt = PromptTemplate.fromTemplate( | |
| "Translate the following to {language}: {text}" | |
| ); | |
| const chain = prompt.pipe(llm); | |
| const result = await chain.invoke({ | |
| language: "Spanish", | |
| text: "Hello, how are you?" | |
| }); | |
| console.log(result.content); // "Hola, ΒΏcΓ³mo estΓ‘s?" | |
| ``` | |
| ## Advanced Patterns | |
| ### Pattern 1: Model Pool (Reusing Loaded Models) | |
| ```javascript | |
| class LLMPool { | |
| constructor() { | |
| this.models = new Map(); | |
| } | |
| async get(modelPath, options = {}) { | |
| if (!this.models.has(modelPath)) { | |
| const llm = new LlamaCppLLM({ modelPath, ...options }); | |
| await llm._initialize(); // Pre-load | |
| this.models.set(modelPath, llm); | |
| } | |
| return this.models.get(modelPath); | |
| } | |
| async disposeAll() { | |
| for (const llm of this.models.values()) { | |
| await llm.dispose(); | |
| } | |
| this.models.clear(); | |
| } | |
| } | |
| // Usage | |
| const pool = new LLMPool(); | |
| const llm = await pool.get('./models/llama-3.1-8b.gguf'); | |
| ``` | |
| ### Pattern 2: Retry on Failure | |
| ```javascript | |
| class ReliableLLM extends LlamaCppLLM { | |
| async _call(input, config = {}) { | |
| const maxRetries = config.maxRetries || 3; | |
| let lastError; | |
| for (let i = 0; i < maxRetries; i++) { | |
| try { | |
| return await super._call(input, config); | |
| } catch (error) { | |
| lastError = error; | |
| console.warn(`Attempt ${i + 1} failed, retrying...`); | |
| await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1))); | |
| } | |
| } | |
| throw new Error(`All ${maxRetries} attempts failed: ${lastError.message}`); | |
| } | |
| } | |
| ``` | |
| ### Pattern 3: Token Counting | |
| ```javascript | |
| class LlamaCppLLMWithCounting extends LlamaCppLLM { | |
| constructor(options) { | |
| super(options); | |
| this.totalTokens = 0; | |
| } | |
| async _call(input, config = {}) { | |
| const result = await super._call(input, config); | |
| // Rough token estimation (4 chars β 1 token) | |
| const promptTokens = Math.ceil(JSON.stringify(input).length / 4); | |
| const completionTokens = Math.ceil(result.content.length / 4); | |
| this.totalTokens += promptTokens + completionTokens; | |
| result.additionalKwargs.usage = { | |
| promptTokens, | |
| completionTokens, | |
| totalTokens: promptTokens + completionTokens | |
| }; | |
| return result; | |
| } | |
| getUsage() { | |
| return { totalTokens: this.totalTokens }; | |
| } | |
| } | |
| ``` | |
| ## Common Patterns and Best Practices | |
| ### β DO: | |
| ```javascript | |
| // Initialize once, use many times | |
| const llm = new LlamaCppLLM({ modelPath: './model.gguf' }); | |
| await llm.invoke("Question 1"); | |
| await llm.invoke("Question 2"); | |
| await llm.dispose(); // Cleanup when done | |
| // Use Messages for structure | |
| const messages = [ | |
| new SystemMessage("You are helpful"), | |
| new HumanMessage("Hi") | |
| ]; | |
| // Clear history for independent calls | |
| const response = await llm.invoke(messages, { clearHistory: true }); | |
| // Handle errors gracefully | |
| try { | |
| const result = await llm.invoke(messages); | |
| } catch (error) { | |
| console.error('Generation failed:', error); | |
| } | |
| ``` | |
| ### β DON'T: | |
| ```javascript | |
| // Don't create new LLM for each request (slow!) | |
| for (const question of questions) { | |
| const llm = new LlamaCppLLM({ modelPath: './model.gguf' }); | |
| await llm.invoke(question); // Loads model every time! | |
| } | |
| // Don't forget to clear history in batch processing | |
| // This will cause history contamination! | |
| for (const q of questions) { | |
| await llm.invoke(q); // Sees all previous questions! | |
| } | |
| // Don't forget cleanup | |
| // Missing: await llm.dispose() | |
| ``` | |
| ## Performance Tips | |
| ### Tip 1: Preload Models | |
| ```javascript | |
| // Load during app startup | |
| const llm = new LlamaCppLLM({ modelPath: './model.gguf' }); | |
| await llm._initialize(); // Force load now | |
| // Later requests are instant | |
| await llm.invoke("Fast response!"); | |
| ``` | |
| ### Tip 2: Use Batch Properly | |
| ```javascript | |
| // This correctly isolates each question | |
| const answers = await llm.batch(questions); | |
| // Not: Sequential with contamination | |
| for (const q of questions) { | |
| await llm.invoke(q); // History builds up! | |
| } | |
| ``` | |
| ### Tip 3: Adjust Context Size | |
| ```javascript | |
| // Smaller context = faster, less memory | |
| const fastLLM = new LlamaCppLLM({ | |
| modelPath: './model.gguf', | |
| contextSize: 2048 // vs default 4096 | |
| }); | |
| ``` | |
| ### Tip 4: Use Appropriate Temperature | |
| ```javascript | |
| // Factual answers: low temperature | |
| const fact = await llm.invoke(query, { temperature: 0.1 }); | |
| // Creative writing: high temperature | |
| const story = await llm.invoke(query, { temperature: 0.9 }); | |
| ``` | |
| ## Debugging Tips | |
| ### Tip 1: Enable Verbose Mode | |
| ```javascript | |
| const llm = new LlamaCppLLM({ | |
| modelPath: './model.gguf', | |
| verbose: true // Shows loading and generation details | |
| }); | |
| ``` | |
| ### Tip 2: Test History Isolation | |
| ```javascript | |
| // Test batch processing | |
| const questions = ["Q1", "Q2", "Q3"]; | |
| const answers = await llm.batch(questions); | |
| // Each answer should be independent | |
| // If Q2 mentions Q1, history contamination occurred! | |
| ``` | |
| ### Tip 3: Verify Streaming | |
| ```javascript | |
| // Verify streaming works | |
| console.log('Testing stream:'); | |
| for await (const chunk of llm.stream("Count to 5")) { | |
| console.log('Chunk:', chunk.content); | |
| } | |
| ``` | |
| ## Common Mistakes | |
| ### β Mistake 1: Not Clearing History in Batches | |
| ```javascript | |
| // Bad: History contamination | |
| const pipeline = formatter.pipe(llm).pipe(parser); | |
| const results = await pipeline.batch(inputs); // Q2 sees Q1! | |
| ``` | |
| **Fix**: The LlamaCppLLM.batch() method automatically clears history: | |
| ```javascript | |
| // Good: Each input is isolated | |
| const results = await llm.batch(inputs); | |
| ``` | |
| ### β Mistake 2: Forgetting Random Seed | |
| ```javascript | |
| // Bad: Temperature doesn't work | |
| const response = await llm.invoke(prompt, { temperature: 0.9 }); | |
| // Without random seed, might get same answer | |
| ``` | |
| **Fix**: Our implementation automatically adds random seed: | |
| ```javascript | |
| // Good: Randomness works properly | |
| if (promptOptions.temperature > 0 && config.seed === undefined) { | |
| promptOptions.seed = Math.floor(Math.random() * 1000000); | |
| } | |
| ``` | |
| ### β Mistake 3: Not Setting System Prompt Properly | |
| ```javascript | |
| // Bad: System prompt persists between calls | |
| await llm.invoke([new SystemMessage("Be creative"), ...]); | |
| await llm.invoke([new HumanMessage("Hi")]); // Still "creative"! | |
| ``` | |
| **Fix**: Always set system prompt (empty string to clear): | |
| ```javascript | |
| // Good: Always explicitly set or clear | |
| this._chatSession.systemPrompt = systemPrompt || ''; | |
| ``` | |
| ## Mental Model | |
| Think of the LLM wrapper as managing a conversation session: | |
| ``` | |
| Call 1: [System: "Be helpful", User: "Hi"] | |
| β | |
| Model generates response | |
| β | |
| Returns: AIMessage("Hello!") | |
| Call 2: [User: "How are you?"] | |
| β | |
| PROBLEM: Still has "Be helpful" system prompt! | |
| PROBLEM: Might remember "Hi" conversation! | |
| SOLUTION: Clear history + reset system prompt between calls | |
| ``` | |
| The wrapper handles: | |
| - Loading models once | |
| - Converting Messages to chat history | |
| - Managing system prompts | |
| - Clearing history when needed | |
| - Streaming chunks | |
| - Random seeds for temperature | |
| - Error handling | |
| ## Summary | |
| Congratulations! You now understand how to wrap a complex LLM library as a clean, composable Runnable with proper state management. | |
| ### Key Takeaways | |
| 1. **Lazy loading saves time**: Load models only when needed | |
| 2. **Messages enable structure**: Proper conversation formatting | |
| 3. **History isolation prevents bugs**: Critical for batch processing | |
| 4. **System prompts must be managed**: Always set or clear explicitly | |
| 5. **Streaming improves UX**: Real-time output feels responsive | |
| 6. **Random seeds enable temperature**: Required for randomness | |
| 7. **Chat wrappers add flexibility**: Support different models | |
| 8. **Sequential batch processing**: Local models can't truly parallelize | |
| ### What You Built | |
| A LLM wrapper that: | |
| - β Loads models lazily | |
| - β Handles Messages properly | |
| - β Manages chat history correctly | |
| - β Isolates batches | |
| - β Supports streaming | |
| - β Handles system prompts | |
| - β Supports chat wrappers | |
| - β Adds random seeds for temperature | |
| - β Provides good error messages | |
| ### Critical Implementation Details | |
| ```javascript | |
| // 1. Always clear and set system prompt | |
| this._chatSession.systemPrompt = systemPrompt || ''; | |
| // 2. Use clearHistory for batch isolation | |
| async batch(inputs, config = {}) { | |
| const results = []; | |
| for (const input of inputs) { | |
| const result = await this._call(input, { | |
| ...config, | |
| clearHistory: true | |
| }); | |
| results.push(result); | |
| } | |
| return results; | |
| } | |
| // 3. Add random seed for temperature | |
| if (promptOptions.temperature > 0 && config.seed === undefined) { | |
| promptOptions.seed = Math.floor(Math.random() * 1000000); | |
| } | |
| // 4. Use correct parameter names | |
| customStopTriggers: config.stopStrings ?? this.stopStrings | |
| ``` | |
| ### What's Next | |
| In the next lesson, we'll explore **Context & Configuration** - how to pass state and settings through chains. | |
| **Preview**: You'll learn: | |
| - RunnableConfig object | |
| - Callback systems | |
| - Metadata tracking | |
| - Debug modes | |
| β‘οΈ [Continue to Lesson 4: Context & Configuration](04-context.md) | |
| ## Additional Resources | |
| - [node-llama-cpp Documentation](https://node-llama-cpp.withcat.ai) | |
| - [Chat Wrappers Guide](https://node-llama-cpp.withcat.ai/guide/chat-wrapper) | |
| - [Temperature Guide](https://node-llama-cpp.withcat.ai/guide/chat-session#temperature) | |
| - [GGUF Model Format](https://huggingface.co/docs/hub/gguf) | |
| ## Questions & Discussion | |
| **Q: Why do we always set system prompt instead of only when present?** | |
| A: To prevent contamination. If call 1 sets a system prompt but call 2 doesn't, call 2 would still use call 1's system prompt. Always setting (even to empty string) ensures clean state. | |
| **Q: Why sequential batch processing instead of parallel?** | |
| A: Local models (node-llama-cpp) can't run true parallel inference on a single model instance. The library serializes requests internally, so parallel Promise.all() provides no benefit and can cause race conditions on the shared chat session. | |
| **Q: Why do we need random seeds for temperature?** | |
| A: The node-llama-cpp library states: "The randomness of the temperature can be controlled by the seed parameter. Setting a specific seed and a specific temperature will yield the same response every time for the same input." Without a random seed, high temperature might still give deterministic results. | |
| **Q: Can I use multiple models simultaneously?** | |
| A: Yes! Each LlamaCppLLM instance can have a different model. Just be aware of memory constraints - each model takes several GB of RAM. | |
| **Q: What's the difference between customStopTriggers and stopStrings?** | |
| A: `customStopTriggers` is the correct parameter name in node-llama-cpp. We accept `stopStrings` in our config for a more intuitive API, then map it to `customStopTriggers` internally. | |
| --- | |
| **Built with β€οΈ for learners who want to understand AI agents deeply** | |
| [β Previous: Messages](02-messages.md) | [Tutorial Index](../README.md) | [Next: Context β](04-context.md) |