lenzcom's picture
Upload folder using huggingface_hub
e706de2 verified

The LLM Wrapper

Part 1: Foundation - Lesson 3

Wrapping node-llama-cpp as a Runnable for seamless integration

Overview

In Lesson 1, you learned about Runnables - the composable interface. In Lesson 2, you mastered Messages - the data structures. Now we'll connect these concepts by wrapping node-llama-cpp (our local LLM) as a Runnable that understands Messages.

By the end of this lesson, you'll have a LLM wrapper that can generate text, handle conversations, stream responses, and integrate seamlessly with chains.

Why Does This Matter?

The Problem: LLMs Don't Compose

node-llama-cpp is excellent at what it does - running local LLMs efficiently. But when you're building agents, you need more than just an LLM. You need components that work together seamlessly.

Without a composable framework:

import { getLlama } from 'node-llama-cpp';

// Each component is isolated - they don't know about each other
async function myAgent(userInput) {
    // Step 1: Format the prompt
    const prompt = myCustomFormatter(userInput);

    // Step 2: Call the LLM
    const llama = await getLlama();
    const model = await llama.loadModel({ modelPath: './model.gguf' });
    const response = await model.createCompletion(prompt);

    // Step 3: Parse the response
    const parsed = myCustomParser(response);

    // Step 4: Maybe call a tool?
    if (parsed.needsTool) {
        const toolResult = await myTool(parsed.args);
        // Now what? Call the LLM again? How do we loop?
        // How do we add logging? Memory? Retries?
    }

    return parsed;
}

// Problems:
// - Can't reuse components
// - Can't chain operations
// - Hard to add logging, metrics, or debugging
// - Complex control flow for agents
// - Every new feature requires changing everything

With a composable framework:

// Components that work together
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });

// Simple usage
const response = await llm.invoke([
    new SystemMessage("You are helpful"),
    new HumanMessage("Hi")
]);
// Returns: AIMessage("Hello! How can I help you?")

// But the real power is composition
const agent = promptTemplate
    .pipe(llm)
    .pipe(outputParser)
    .pipe(toolExecutor);

// Now you can:
// βœ… Reuse components in different chains
// βœ… Add logging with callbacks (no code changes)
// βœ… Build complex agents that use tools
// βœ… Test each component independently
// βœ… Swap LLMs without rewriting everything

What the Wrapper Provides

The LLM wrapper isn't about making node-llama-cpp easier - it's about making it work with everything else:

  1. Common Interface: Same invoke() / stream() / batch() as every other component
  2. Message Support: Understands HumanMessage, AIMessage, SystemMessage
  3. Composability: Works with .pipe() to chain operations
  4. Observability: Callbacks work automatically for logging/metrics
  5. Configuration: Runtime settings pass through cleanly
  6. History Isolation: Proper batch processing without contamination

Think of it as an adapter that lets node-llama-cpp play nicely with the rest of your agent system.

Learning Objectives

By the end of this lesson, you will:

  • βœ… Understand how to wrap complex libraries as Runnables
  • βœ… Convert Messages to LLM chat history
  • βœ… Handle model loading and lifecycle
  • βœ… Implement streaming for real-time output
  • βœ… Add temperature and other generation parameters
  • βœ… Manage context windows and chat history
  • βœ… Handle batch processing with history isolation

Core Concepts

What is an LLM Wrapper?

An LLM wrapper is an abstraction layer that:

  1. Hides complexity - No need to manage contexts, sessions, or cleanup
  2. Provides a standard interface - Same API regardless of underlying model
  3. Handles conversion - Transforms Messages into model-specific chat history
  4. Manages resources - Automatic initialization and cleanup
  5. Enables composition - Works seamlessly in chains
  6. Isolates state - Prevents history contamination in batch processing

The Wrapper's Responsibilities

Input (Messages)
      ↓
[1. Convert to Chat History]
      ↓
[2. Manage System Prompt]
      ↓
[3. Call LLM]
      ↓
[4. Parse Response]
      ↓
Output (AIMessage)

Key Challenges

  1. Model Loading: Models are large and slow to load
  2. Chat History Format: Convert Messages to node-llama-cpp format
  3. System Prompt Management: Clear and set for each call
  4. Context Management: Limited context windows
  5. Streaming: Real-time output is complex
  6. Batch Isolation: Prevent history contamination
  7. Error Handling: Models can fail in various ways
  8. Chat Wrappers: Different models need different formats

Implementation Deep Dive

Let's build the LLM wrapper step by step.

Step 1: The Base Structure

Location: src/llm/llama-cpp-llm.js

import { Runnable } from './runnable.js';
import { AIMessage } from './message.js';
import { getLlama, LlamaChatSession } from 'node-llama-cpp';

export class LlamaCppLLM extends Runnable {
  constructor(options = {}) {
    super();

    // Model configuration
    this.modelPath = options.modelPath;
    this.temperature = options.temperature ?? 0.7;
    this.maxTokens = options.maxTokens ?? 2048;
    this.contextSize = options.contextSize ?? 4096;
    
    // Chat wrapper configuration (auto-detects by default)
    this.chatWrapper = options.chatWrapper ?? 'auto';

    // Internal state
    this._llama = null;
    this._model = null;
    this._context = null;
    this._chatSession = null;
    this._initialized = false;
  }

  async _call(input, config) {
    // Will implement next
  }
}

Key decisions:

  • Stores configuration (temperature, max tokens, etc.)
  • Supports custom chat wrappers (e.g., QwenChatWrapper)
  • Tracks internal state (model, context, session)
  • Lazy initialization (load on first use)

Step 2: Model Initialization with Chat Wrapper Support

export class LlamaCppLLM extends Runnable {
  // ... constructor ...

  /**
   * Initialize the model (lazy loading)
   */
  async _initialize() {
    if (this._initialized) return;

    if (this.verbose) {
      console.log(`Loading model: ${this.modelPath}`);
    }

    try {
      // Step 1: Get llama instance
      this._llama = await getLlama();

      // Step 2: Load the model
      this._model = await this._llama.loadModel({
        modelPath: this.modelPath
      });

      // Step 3: Create context (working memory)
      this._context = await this._model.createContext({
        contextSize: this.contextSize,
        batchSize: this.batchSize
      });

      // Step 4: Create chat session with optional chat wrapper
      const contextSequence = this._context.getSequence();
      const sessionConfig = { contextSequence };

      // Add custom chat wrapper if specified
      if (this.chatWrapper !== 'auto') {
        sessionConfig.chatWrapper = this.chatWrapper;
      }

      this._chatSession = new LlamaChatSession(sessionConfig);

      this._initialized = true;

      if (this.verbose) {
        console.log('βœ“ Model loaded successfully');
        if (this.chatWrapper !== 'auto') {
          console.log(`βœ“ Using custom chat wrapper: ${this.chatWrapper.constructor.name}`);
        }
      }
    } catch (error) {
      throw new Error(
        `Failed to initialize model at ${this.modelPath}: ${error.message}`
      );
    }
  }

  /**
   * Cleanup resources
   */
  async dispose() {
    if (this._context) {
      await this._context.dispose();
      this._context = null;
    }
    if (this._model) {
      await this._model.dispose();
      this._model = null;
    }
    this._chatSession = null;
    this._initialized = false;

    if (this.verbose) {
      console.log('βœ“ Model resources disposed');
    }
  }
}

Why lazy loading?

  • Models take 5-30 seconds to load
  • Don't load until actually needed
  • Share one loaded model across multiple calls

Chat Wrapper Support:

  • Defaults to 'auto' (library auto-detects)
  • Supports custom wrappers like QwenChatWrapper for specific models
  • Useful for controlling model behavior (e.g., discouraging thoughts)

Step 3: Converting Messages to Chat History

export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  /**
   * Convert our Message objects to node-llama-cpp chat history format
   */
  _messagesToChatHistory(messages) {
    return messages.map(msg => {
      // System messages: instructions for the AI
      if (msg._type === 'system') {
        return { type: 'system', text: msg.content };
      }
      // Human messages: user input
      else if (msg._type === 'human') {
        return { type: 'user', text: msg.content };
      }
      // AI messages: previous AI responses
      else if (msg._type === 'ai') {
        return { type: 'model', response: msg.content };
      }
      // Tool messages: results from tool execution
      else if (msg._type === 'tool') {
        return { type: 'system', text: `Tool Result: ${msg.content}` };
      }

      // Fallback: treat unknown types as user messages
      return { type: 'user', text: msg.content };
    });
  }
}

Key insight: This bridges between your standardized Message types and what node-llama-cpp expects. Different models may need different chat formats, which is why chat wrappers exist.

Step 4: The Main Generation Method

export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  async _call(input, config = {}) {
    // Initialize if needed
    await this._initialize();

    // Clear history if requested (important for batch processing)
    if (config.clearHistory) {
      this._chatSession.setChatHistory([]);
    }

    // Handle different input types
    let messages;
    if (typeof input === 'string') {
      messages = [new HumanMessage(input)];
    } else if (Array.isArray(input)) {
      messages = input;
    } else {
      throw new Error('Input must be string or array of messages');
    }

    // Extract system message if present
    const systemMessages = messages.filter(msg => msg._type === 'system');
    const systemPrompt = systemMessages.length > 0
      ? systemMessages[0].content
      : '';

    // Convert our Message objects to llama.cpp format
    const chatHistory = this._messagesToChatHistory(messages);
    this._chatSession.setChatHistory(chatHistory);

    // ALWAYS set system prompt (either new value or empty string to clear)
    this._chatSession.systemPrompt = systemPrompt;

    try {
      // Build prompt options
      const promptOptions = {
        temperature: config.temperature ?? this.temperature,
        topP: config.topP ?? this.topP,
        topK: config.topK ?? this.topK,
        maxTokens: config.maxTokens ?? this.maxTokens,
        repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,
        customStopTriggers: config.stopStrings ?? this.stopStrings
      };

      // Add random seed if temperature > 0 and no seed specified
      // This ensures randomness works properly
      if (promptOptions.temperature > 0 && config.seed === undefined) {
        promptOptions.seed = Math.floor(Math.random() * 1000000);
      } else if (config.seed !== undefined) {
        promptOptions.seed = config.seed;
      }

      // Generate response using prompt
      const response = await this._chatSession.prompt('', promptOptions);

      // Return as AIMessage for consistency
      return new AIMessage(response);
    } catch (error) {
      throw new Error(`Generation failed: ${error.message}`);
    }
  }
}

Critical details:

  • Always clears and sets system prompt (prevents contamination)
  • Adds random seed for proper temperature behavior
  • Uses customStopTriggers (correct parameter name)
  • Supports clearHistory for batch processing

Step 5: Batch Processing with History Isolation

export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  /**
   * Batch processing with history isolation
   * 
   * Processes multiple inputs sequentially, ensuring each gets 
   * a clean chat history to prevent contamination.
   */
  async batch(inputs, config = {}) {
    const results = [];
    for (const input of inputs) {
      // Clear history before each batch item
      const result = await this._call(input, { 
        ...config, 
        clearHistory: true 
      });
      results.push(result);
    }
    return results;
  }
}

Why sequential processing?

  • Local models can't run truly in parallel
  • Sequential ensures proper history isolation
  • Each item gets a clean slate

Step 6: Streaming Support

For real-time output (like ChatGPT's typing effect):

export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  async *_stream(input, config = {}) {
    await this._initialize();

    // Clear history if requested
    if (config.clearHistory) {
      this._chatSession.setChatHistory([]);
    }

    // Handle input types (same as _call)
    let messages;
    if (typeof input === 'string') {
      messages = [new HumanMessage(input)];
    } else if (Array.isArray(input)) {
      messages = input;
    } else {
      throw new Error('Input must be string or array of messages');
    }

    // Extract system message
    const systemMessages = messages.filter(msg => msg._type === 'system');
    const systemPrompt = systemMessages.length > 0
      ? systemMessages[0].content
      : '';

    // Set up chat history
    const chatHistory = this._messagesToChatHistory(messages);
    this._chatSession.setChatHistory(chatHistory);

    // ALWAYS set system prompt
    this._chatSession.systemPrompt = systemPrompt;

    try {
      // Build prompt options
      const promptOptions = {
        temperature: config.temperature ?? this.temperature,
        topP: config.topP ?? this.topP,
        topK: config.topK ?? this.topK,
        maxTokens: config.maxTokens ?? this.maxTokens,
        repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,
        customStopTriggers: config.stopStrings ?? this.stopStrings
      };

      // Add random seed
      if (promptOptions.temperature > 0 && config.seed === undefined) {
        promptOptions.seed = Math.floor(Math.random() * 1000000);
      } else if (config.seed !== undefined) {
        promptOptions.seed = config.seed;
      }

      // Use onTextChunk callback to collect chunks
      const self = this;
      promptOptions.onTextChunk = (chunk) => {
        self._currentStreamChunks = self._currentStreamChunks || [];
        self._currentStreamChunks.push(chunk);
      };

      // Initialize chunk collection
      this._currentStreamChunks = [];

      // Start generation
      const responsePromise = this._chatSession.prompt('', promptOptions);

      // Yield chunks as they become available
      let lastYieldedIndex = 0;

      // Poll for new chunks
      while (true) {
        // Yield any new chunks
        while (lastYieldedIndex < this._currentStreamChunks.length) {
          yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {
            additionalKwargs: { chunk: true }
          });
          lastYieldedIndex++;
        }

        // Check if generation is complete
        const isDone = await Promise.race([
          responsePromise.then(() => true),
          new Promise(resolve => setTimeout(() => resolve(false), 10))
        ]);

        if (isDone) {
          // Yield any remaining chunks
          while (lastYieldedIndex < this._currentStreamChunks.length) {
            yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {
              additionalKwargs: { chunk: true }
            });
            lastYieldedIndex++;
          }
          break;
        }
      }

      // Wait for completion
      await responsePromise;

      // Clean up
      delete this._currentStreamChunks;

    } catch (error) {
      throw new Error(`Streaming failed: ${error.message}`);
    }
  }
}

Streaming challenges:

  • onTextChunk is a synchronous callback
  • Can't yield directly from callback
  • Use polling mechanism to yield as chunks arrive
  • 10ms polling interval balances responsiveness vs CPU usage

Real-World Examples

Example 1: Simple Text Generation

const llm = new LlamaCppLLM({
  modelPath: './models/Meta-Llama-3.1-8B-Instruct-Q5_K_S.gguf',
  temperature: 0.7,
  maxTokens: 100
});

// Simple string input
const response = await llm.invoke("What is 2+2?");
console.log(response.content); // "2+2 equals 4."

Example 2: Conversation with System Prompt

const messages = [
  new SystemMessage("You are a helpful math tutor."),
  new HumanMessage("What is 5*5?")
];

const response = await llm.invoke(messages);
console.log(response.content);
// "5 times 5 is 25. Here's a simple explanation..."

Example 3: Using Qwen Chat Wrapper

import { QwenChatWrapper } from 'node-llama-cpp';

const llm = new LlamaCppLLM({
  modelPath: './models/Qwen3-1.7B-Q6_K.gguf',
  temperature: 0.7,
  chatWrapper: new QwenChatWrapper({
    thoughts: 'discourage'  // Prevents thinking tokens
  })
});

const response = await llm.invoke("What is AI?");
// Response won't include <think> tokens

Example 4: Temperature Comparison

const question = "Give me one adjective to describe winter:";

// Low temperature - consistent answers
llm._chatSession.setChatHistory([]);
const lowTemp = await llm.invoke(question, { temperature: 0.1 });
// Likely: "cold"

// High temperature - varied answers  
llm._chatSession.setChatHistory([]);
const highTemp = await llm.invoke(question, { temperature: 0.9 });
// Could be: "frosty", "snowy", "icy", "chilly"

Example 5: Streaming Output

console.log('Response: ');
for await (const chunk of llm.stream("Tell me a fun fact about space")) {
  process.stdout.write(chunk.content); // No newline
}
console.log('\n');

// Output streams in real-time as it's generated

Example 6: Batch Processing

const questions = [
  "What is Python?",
  "What is JavaScript?",
  "What is Rust?"
];

const answers = await llm.batch(questions);

questions.forEach((q, i) => {
  console.log(`Q: ${q}`);
  console.log(`A: ${answers[i].content}`);
  console.log();
});

// Each answer is independent - no history contamination!

Example 7: Using in a Pipeline

import { PromptTemplate } from '../prompts/prompt-template.js';

const prompt = PromptTemplate.fromTemplate(
  "Translate the following to {language}: {text}"
);

const chain = prompt.pipe(llm);

const result = await chain.invoke({
  language: "Spanish",
  text: "Hello, how are you?"
});

console.log(result.content); // "Hola, ΒΏcΓ³mo estΓ‘s?"

Advanced Patterns

Pattern 1: Model Pool (Reusing Loaded Models)

class LLMPool {
  constructor() {
    this.models = new Map();
  }

  async get(modelPath, options = {}) {
    if (!this.models.has(modelPath)) {
      const llm = new LlamaCppLLM({ modelPath, ...options });
      await llm._initialize(); // Pre-load
      this.models.set(modelPath, llm);
    }
    return this.models.get(modelPath);
  }

  async disposeAll() {
    for (const llm of this.models.values()) {
      await llm.dispose();
    }
    this.models.clear();
  }
}

// Usage
const pool = new LLMPool();
const llm = await pool.get('./models/llama-3.1-8b.gguf');

Pattern 2: Retry on Failure

class ReliableLLM extends LlamaCppLLM {
  async _call(input, config = {}) {
    const maxRetries = config.maxRetries || 3;
    let lastError;

    for (let i = 0; i < maxRetries; i++) {
      try {
        return await super._call(input, config);
      } catch (error) {
        lastError = error;
        console.warn(`Attempt ${i + 1} failed, retrying...`);
        await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1)));
      }
    }

    throw new Error(`All ${maxRetries} attempts failed: ${lastError.message}`);
  }
}

Pattern 3: Token Counting

class LlamaCppLLMWithCounting extends LlamaCppLLM {
  constructor(options) {
    super(options);
    this.totalTokens = 0;
  }

  async _call(input, config = {}) {
    const result = await super._call(input, config);

    // Rough token estimation (4 chars β‰ˆ 1 token)
    const promptTokens = Math.ceil(JSON.stringify(input).length / 4);
    const completionTokens = Math.ceil(result.content.length / 4);

    this.totalTokens += promptTokens + completionTokens;

    result.additionalKwargs.usage = {
      promptTokens,
      completionTokens,
      totalTokens: promptTokens + completionTokens
    };

    return result;
  }

  getUsage() {
    return { totalTokens: this.totalTokens };
  }
}

Common Patterns and Best Practices

βœ… DO:

// Initialize once, use many times
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
await llm.invoke("Question 1");
await llm.invoke("Question 2");
await llm.dispose(); // Cleanup when done

// Use Messages for structure
const messages = [
  new SystemMessage("You are helpful"),
  new HumanMessage("Hi")
];

// Clear history for independent calls
const response = await llm.invoke(messages, { clearHistory: true });

// Handle errors gracefully
try {
  const result = await llm.invoke(messages);
} catch (error) {
  console.error('Generation failed:', error);
}

❌ DON'T:

// Don't create new LLM for each request (slow!)
for (const question of questions) {
  const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
  await llm.invoke(question); // Loads model every time!
}

// Don't forget to clear history in batch processing
// This will cause history contamination!
for (const q of questions) {
  await llm.invoke(q); // Sees all previous questions!
}

// Don't forget cleanup
// Missing: await llm.dispose()

Performance Tips

Tip 1: Preload Models

// Load during app startup
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
await llm._initialize(); // Force load now

// Later requests are instant
await llm.invoke("Fast response!");

Tip 2: Use Batch Properly

// This correctly isolates each question
const answers = await llm.batch(questions);

// Not: Sequential with contamination
for (const q of questions) {
  await llm.invoke(q); // History builds up!
}

Tip 3: Adjust Context Size

// Smaller context = faster, less memory
const fastLLM = new LlamaCppLLM({
  modelPath: './model.gguf',
  contextSize: 2048  // vs default 4096
});

Tip 4: Use Appropriate Temperature

// Factual answers: low temperature
const fact = await llm.invoke(query, { temperature: 0.1 });

// Creative writing: high temperature
const story = await llm.invoke(query, { temperature: 0.9 });

Debugging Tips

Tip 1: Enable Verbose Mode

const llm = new LlamaCppLLM({
  modelPath: './model.gguf',
  verbose: true  // Shows loading and generation details
});

Tip 2: Test History Isolation

// Test batch processing
const questions = ["Q1", "Q2", "Q3"];
const answers = await llm.batch(questions);

// Each answer should be independent
// If Q2 mentions Q1, history contamination occurred!

Tip 3: Verify Streaming

// Verify streaming works
console.log('Testing stream:');
for await (const chunk of llm.stream("Count to 5")) {
  console.log('Chunk:', chunk.content);
}

Common Mistakes

❌ Mistake 1: Not Clearing History in Batches

// Bad: History contamination
const pipeline = formatter.pipe(llm).pipe(parser);
const results = await pipeline.batch(inputs); // Q2 sees Q1!

Fix: The LlamaCppLLM.batch() method automatically clears history:

// Good: Each input is isolated
const results = await llm.batch(inputs);

❌ Mistake 2: Forgetting Random Seed

// Bad: Temperature doesn't work
const response = await llm.invoke(prompt, { temperature: 0.9 });
// Without random seed, might get same answer

Fix: Our implementation automatically adds random seed:

// Good: Randomness works properly
if (promptOptions.temperature > 0 && config.seed === undefined) {
  promptOptions.seed = Math.floor(Math.random() * 1000000);
}

❌ Mistake 3: Not Setting System Prompt Properly

// Bad: System prompt persists between calls
await llm.invoke([new SystemMessage("Be creative"), ...]);
await llm.invoke([new HumanMessage("Hi")]); // Still "creative"!

Fix: Always set system prompt (empty string to clear):

// Good: Always explicitly set or clear
this._chatSession.systemPrompt = systemPrompt || '';

Mental Model

Think of the LLM wrapper as managing a conversation session:

Call 1: [System: "Be helpful", User: "Hi"]
        ↓
      Model generates response
        ↓
      Returns: AIMessage("Hello!")

Call 2: [User: "How are you?"]  
        ↓
      PROBLEM: Still has "Be helpful" system prompt!
      PROBLEM: Might remember "Hi" conversation!
        
SOLUTION: Clear history + reset system prompt between calls

The wrapper handles:

  • Loading models once
  • Converting Messages to chat history
  • Managing system prompts
  • Clearing history when needed
  • Streaming chunks
  • Random seeds for temperature
  • Error handling

Summary

Congratulations! You now understand how to wrap a complex LLM library as a clean, composable Runnable with proper state management.

Key Takeaways

  1. Lazy loading saves time: Load models only when needed
  2. Messages enable structure: Proper conversation formatting
  3. History isolation prevents bugs: Critical for batch processing
  4. System prompts must be managed: Always set or clear explicitly
  5. Streaming improves UX: Real-time output feels responsive
  6. Random seeds enable temperature: Required for randomness
  7. Chat wrappers add flexibility: Support different models
  8. Sequential batch processing: Local models can't truly parallelize

What You Built

A LLM wrapper that:

  • βœ… Loads models lazily
  • βœ… Handles Messages properly
  • βœ… Manages chat history correctly
  • βœ… Isolates batches
  • βœ… Supports streaming
  • βœ… Handles system prompts
  • βœ… Supports chat wrappers
  • βœ… Adds random seeds for temperature
  • βœ… Provides good error messages

Critical Implementation Details

// 1. Always clear and set system prompt
this._chatSession.systemPrompt = systemPrompt || '';

// 2. Use clearHistory for batch isolation
async batch(inputs, config = {}) {
  const results = [];
  for (const input of inputs) {
    const result = await this._call(input, { 
      ...config, 
      clearHistory: true 
    });
    results.push(result);
  }
  return results;
}

// 3. Add random seed for temperature
if (promptOptions.temperature > 0 && config.seed === undefined) {
  promptOptions.seed = Math.floor(Math.random() * 1000000);
}

// 4. Use correct parameter names
customStopTriggers: config.stopStrings ?? this.stopStrings

What's Next

In the next lesson, we'll explore Context & Configuration - how to pass state and settings through chains.

Preview: You'll learn:

  • RunnableConfig object
  • Callback systems
  • Metadata tracking
  • Debug modes

➑️ Continue to Lesson 4: Context & Configuration

Additional Resources

Questions & Discussion

Q: Why do we always set system prompt instead of only when present?

A: To prevent contamination. If call 1 sets a system prompt but call 2 doesn't, call 2 would still use call 1's system prompt. Always setting (even to empty string) ensures clean state.

Q: Why sequential batch processing instead of parallel?

A: Local models (node-llama-cpp) can't run true parallel inference on a single model instance. The library serializes requests internally, so parallel Promise.all() provides no benefit and can cause race conditions on the shared chat session.

Q: Why do we need random seeds for temperature?

A: The node-llama-cpp library states: "The randomness of the temperature can be controlled by the seed parameter. Setting a specific seed and a specific temperature will yield the same response every time for the same input." Without a random seed, high temperature might still give deterministic results.

Q: Can I use multiple models simultaneously?

A: Yes! Each LlamaCppLLM instance can have a different model. Just be aware of memory constraints - each model takes several GB of RAM.

Q: What's the difference between customStopTriggers and stopStrings?

A: customStopTriggers is the correct parameter name in node-llama-cpp. We accept stopStrings in our config for a more intuitive API, then map it to customStopTriggers internally.


Built with ❀️ for learners who want to understand AI agents deeply

← Previous: Messages | Tutorial Index | Next: Context β†’