Spaces:

lenzcom
/

Email

Running

App Files Files Community

Email / tutorial /01-foundation /03-llm-wrapper /lesson.md

lenzcom

Upload folder using huggingface_hub

e706de2 verified 1 day ago

preview code

raw

history blame contribute delete

30.5 kB

The LLM Wrapper

Part 1: Foundation - Lesson 3

Wrapping node-llama-cpp as a Runnable for seamless integration

Overview

In Lesson 1, you learned about Runnables - the composable interface. In Lesson 2, you mastered Messages - the data structures. Now we'll connect these concepts by wrapping node-llama-cpp (our local LLM) as a Runnable that understands Messages.

By the end of this lesson, you'll have a LLM wrapper that can generate text, handle conversations, stream responses, and integrate seamlessly with chains.

Why Does This Matter?

The Problem: LLMs Don't Compose

node-llama-cpp is excellent at what it does - running local LLMs efficiently. But when you're building agents, you need more than just an LLM. You need components that work together seamlessly.

Without a composable framework:

import { getLlama } from 'node-llama-cpp';

// Each component is isolated - they don't know about each other
async function myAgent(userInput) {
    // Step 1: Format the prompt
    const prompt = myCustomFormatter(userInput);

    // Step 2: Call the LLM
    const llama = await getLlama();
    const model = await llama.loadModel({ modelPath: './model.gguf' });
    const response = await model.createCompletion(prompt);

    // Step 3: Parse the response
    const parsed = myCustomParser(response);

    // Step 4: Maybe call a tool?
    if (parsed.needsTool) {
        const toolResult = await myTool(parsed.args);
        // Now what? Call the LLM again? How do we loop?
        // How do we add logging? Memory? Retries?
    }

    return parsed;
}

// Problems:
// - Can't reuse components
// - Can't chain operations
// - Hard to add logging, metrics, or debugging
// - Complex control flow for agents
// - Every new feature requires changing everything

With a composable framework:

// Components that work together
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });

// Simple usage
const response = await llm.invoke([
    new SystemMessage("You are helpful"),
    new HumanMessage("Hi")
]);
// Returns: AIMessage("Hello! How can I help you?")

// But the real power is composition
const agent = promptTemplate
    .pipe(llm)
    .pipe(outputParser)
    .pipe(toolExecutor);

// Now you can:
// ✅ Reuse components in different chains
// ✅ Add logging with callbacks (no code changes)
// ✅ Build complex agents that use tools
// ✅ Test each component independently
// ✅ Swap LLMs without rewriting everything

What the Wrapper Provides

The LLM wrapper isn't about making node-llama-cpp easier - it's about making it work with everything else:

Common Interface: Same invoke() / stream() / batch() as every other component
Message Support: Understands HumanMessage, AIMessage, SystemMessage
Composability: Works with .pipe() to chain operations
Observability: Callbacks work automatically for logging/metrics
Configuration: Runtime settings pass through cleanly
History Isolation: Proper batch processing without contamination

Think of it as an adapter that lets node-llama-cpp play nicely with the rest of your agent system.

Learning Objectives

By the end of this lesson, you will:

✅ Understand how to wrap complex libraries as Runnables
✅ Convert Messages to LLM chat history
✅ Handle model loading and lifecycle
✅ Implement streaming for real-time output
✅ Add temperature and other generation parameters
✅ Manage context windows and chat history
✅ Handle batch processing with history isolation

Core Concepts

What is an LLM Wrapper?

An LLM wrapper is an abstraction layer that:

Hides complexity - No need to manage contexts, sessions, or cleanup
Provides a standard interface - Same API regardless of underlying model
Handles conversion - Transforms Messages into model-specific chat history
Manages resources - Automatic initialization and cleanup
Enables composition - Works seamlessly in chains
Isolates state - Prevents history contamination in batch processing

The Wrapper's Responsibilities

Input (Messages)
      ↓
[1. Convert to Chat History]
      ↓
[2. Manage System Prompt]
      ↓
[3. Call LLM]
      ↓
[4. Parse Response]
      ↓
Output (AIMessage)

Key Challenges

Model Loading: Models are large and slow to load
Chat History Format: Convert Messages to node-llama-cpp format
System Prompt Management: Clear and set for each call
Context Management: Limited context windows
Streaming: Real-time output is complex
Batch Isolation: Prevent history contamination
Error Handling: Models can fail in various ways
Chat Wrappers: Different models need different formats

Implementation Deep Dive

Let's build the LLM wrapper step by step.

Step 1: The Base Structure

Location: src/llm/llama-cpp-llm.js

import { Runnable } from './runnable.js';
import { AIMessage } from './message.js';
import { getLlama, LlamaChatSession } from 'node-llama-cpp';

export class LlamaCppLLM extends Runnable {
  constructor(options = {}) {
    super();

    // Model configuration
    this.modelPath = options.modelPath;
    this.temperature = options.temperature ?? 0.7;
    this.maxTokens = options.maxTokens ?? 2048;
    this.contextSize = options.contextSize ?? 4096;
    
    // Chat wrapper configuration (auto-detects by default)
    this.chatWrapper = options.chatWrapper ?? 'auto';

    // Internal state
    this._llama = null;
    this._model = null;
    this._context = null;
    this._chatSession = null;
    this._initialized = false;
  }

  async _call(input, config) {
    // Will implement next
  }
}

Key decisions:

Stores configuration (temperature, max tokens, etc.)
Supports custom chat wrappers (e.g., QwenChatWrapper)
Tracks internal state (model, context, session)
Lazy initialization (load on first use)

Step 2: Model Initialization with Chat Wrapper Support

export class LlamaCppLLM extends Runnable {
  // ... constructor ...

  /**
   * Initialize the model (lazy loading)
   */
  async _initialize() {
    if (this._initialized) return;

    if (this.verbose) {
      console.log(`Loading model: ${this.modelPath}`);
    }

    try {
      // Step 1: Get llama instance
      this._llama = await getLlama();

      // Step 2: Load the model
      this._model = await this._llama.loadModel({
        modelPath: this.modelPath
      });

      // Step 3: Create context (working memory)
      this._context = await this._model.createContext({
        contextSize: this.contextSize,
        batchSize: this.batchSize
      });

      // Step 4: Create chat session with optional chat wrapper
      const contextSequence = this._context.getSequence();
      const sessionConfig = { contextSequence };

      // Add custom chat wrapper if specified
      if (this.chatWrapper !== 'auto') {
        sessionConfig.chatWrapper = this.chatWrapper;
      }

      this._chatSession = new LlamaChatSession(sessionConfig);

      this._initialized = true;

      if (this.verbose) {
        console.log('✓ Model loaded successfully');
        if (this.chatWrapper !== 'auto') {
          console.log(`✓ Using custom chat wrapper: ${this.chatWrapper.constructor.name}`);
        }
      }
    } catch (error) {
      throw new Error(
        `Failed to initialize model at ${this.modelPath}: ${error.message}`
      );
    }
  }

  /**
   * Cleanup resources
   */
  async dispose() {
    if (this._context) {
      await this._context.dispose();
      this._context = null;
    }
    if (this._model) {
      await this._model.dispose();
      this._model = null;
    }
    this._chatSession = null;
    this._initialized = false;

    if (this.verbose) {
      console.log('✓ Model resources disposed');
    }
  }
}

Why lazy loading?

Models take 5-30 seconds to load
Don't load until actually needed
Share one loaded model across multiple calls

Chat Wrapper Support:

Defaults to 'auto' (library auto-detects)
Supports custom wrappers like QwenChatWrapper for specific models
Useful for controlling model behavior (e.g., discouraging thoughts)

Step 3: Converting Messages to Chat History

export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  /**
   * Convert our Message objects to node-llama-cpp chat history format
   */
  _messagesToChatHistory(messages) {
    return messages.map(msg => {
      // System messages: instructions for the AI
      if (msg._type === 'system') {
        return { type: 'system', text: msg.content };
      }
      // Human messages: user input
      else if (msg._type === 'human') {
        return { type: 'user', text: msg.content };
      }
      // AI messages: previous AI responses
      else if (msg._type === 'ai') {
        return { type: 'model', response: msg.content };
      }
      // Tool messages: results from tool execution
      else if (msg._type === 'tool') {
        return { type: 'system', text: `Tool Result: ${msg.content}` };
      }

      // Fallback: treat unknown types as user messages
      return { type: 'user', text: msg.content };
    });
  }
}

Key insight: This bridges between your standardized Message types and what node-llama-cpp expects. Different models may need different chat formats, which is why chat wrappers exist.

Step 4: The Main Generation Method

export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  async _call(input, config = {}) {
    // Initialize if needed
    await this._initialize();

    // Clear history if requested (important for batch processing)
    if (config.clearHistory) {
      this._chatSession.setChatHistory([]);
    }

    // Handle different input types
    let messages;
    if (typeof input === 'string') {
      messages = [new HumanMessage(input)];
    } else if (Array.isArray(input)) {
      messages = input;
    } else {
      throw new Error('Input must be string or array of messages');
    }

    // Extract system message if present
    const systemMessages = messages.filter(msg => msg._type === 'system');
    const systemPrompt = systemMessages.length > 0
      ? systemMessages[0].content
      : '';

    // Convert our Message objects to llama.cpp format
    const chatHistory = this._messagesToChatHistory(messages);
    this._chatSession.setChatHistory(chatHistory);

    // ALWAYS set system prompt (either new value or empty string to clear)
    this._chatSession.systemPrompt = systemPrompt;

    try {
      // Build prompt options
      const promptOptions = {
        temperature: config.temperature ?? this.temperature,
        topP: config.topP ?? this.topP,
        topK: config.topK ?? this.topK,
        maxTokens: config.maxTokens ?? this.maxTokens,
        repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,
        customStopTriggers: config.stopStrings ?? this.stopStrings
      };

      // Add random seed if temperature > 0 and no seed specified
      // This ensures randomness works properly
      if (promptOptions.temperature > 0 && config.seed === undefined) {
        promptOptions.seed = Math.floor(Math.random() * 1000000);
      } else if (config.seed !== undefined) {
        promptOptions.seed = config.seed;
      }

      // Generate response using prompt
      const response = await this._chatSession.prompt('', promptOptions);

      // Return as AIMessage for consistency
      return new AIMessage(response);
    } catch (error) {
      throw new Error(`Generation failed: ${error.message}`);
    }
  }
}

Critical details:

Always clears and sets system prompt (prevents contamination)
Adds random seed for proper temperature behavior
Uses customStopTriggers (correct parameter name)
Supports clearHistory for batch processing

Step 5: Batch Processing with History Isolation

export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  /**
   * Batch processing with history isolation
   * 
   * Processes multiple inputs sequentially, ensuring each gets 
   * a clean chat history to prevent contamination.
   */
  async batch(inputs, config = {}) {
    const results = [];
    for (const input of inputs) {
      // Clear history before each batch item
      const result = await this._call(input, { 
        ...config, 
        clearHistory: true 
      });
      results.push(result);
    }
    return results;
  }
}

Why sequential processing?

Local models can't run truly in parallel
Sequential ensures proper history isolation
Each item gets a clean slate

Step 6: Streaming Support

For real-time output (like ChatGPT's typing effect):

export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  async *_stream(input, config = {}) {
    await this._initialize();

    // Clear history if requested
    if (config.clearHistory) {
      this._chatSession.setChatHistory([]);
    }

    // Handle input types (same as _call)
    let messages;
    if (typeof input === 'string') {
      messages = [new HumanMessage(input)];
    } else if (Array.isArray(input)) {
      messages = input;
    } else {
      throw new Error('Input must be string or array of messages');
    }

    // Extract system message
    const systemMessages = messages.filter(msg => msg._type === 'system');
    const systemPrompt = systemMessages.length > 0
      ? systemMessages[0].content
      : '';

    // Set up chat history
    const chatHistory = this._messagesToChatHistory(messages);
    this._chatSession.setChatHistory(chatHistory);

    // ALWAYS set system prompt
    this._chatSession.systemPrompt = systemPrompt;

    try {
      // Build prompt options
      const promptOptions = {
        temperature: config.temperature ?? this.temperature,
        topP: config.topP ?? this.topP,
        topK: config.topK ?? this.topK,
        maxTokens: config.maxTokens ?? this.maxTokens,
        repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,
        customStopTriggers: config.stopStrings ?? this.stopStrings
      };

      // Add random seed
      if (promptOptions.temperature > 0 && config.seed === undefined) {
        promptOptions.seed = Math.floor(Math.random() * 1000000);
      } else if (config.seed !== undefined) {
        promptOptions.seed = config.seed;
      }

      // Use onTextChunk callback to collect chunks
      const self = this;
      promptOptions.onTextChunk = (chunk) => {
        self._currentStreamChunks = self._currentStreamChunks || [];
        self._currentStreamChunks.push(chunk);
      };

      // Initialize chunk collection
      this._currentStreamChunks = [];

      // Start generation
      const responsePromise = this._chatSession.prompt('', promptOptions);

      // Yield chunks as they become available
      let lastYieldedIndex = 0;

      // Poll for new chunks
      while (true) {
        // Yield any new chunks
        while (lastYieldedIndex < this._currentStreamChunks.length) {
          yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {
            additionalKwargs: { chunk: true }
          });
          lastYieldedIndex++;
        }

        // Check if generation is complete
        const isDone = await Promise.race([
          responsePromise.then(() => true),
          new Promise(resolve => setTimeout(() => resolve(false), 10))
        ]);

        if (isDone) {
          // Yield any remaining chunks
          while (lastYieldedIndex < this._currentStreamChunks.length) {
            yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {
              additionalKwargs: { chunk: true }
            });
            lastYieldedIndex++;
          }
          break;
        }
      }

      // Wait for completion
      await responsePromise;

      // Clean up
      delete this._currentStreamChunks;

    } catch (error) {
      throw new Error(`Streaming failed: ${error.message}`);
    }
  }
}

Streaming challenges:

onTextChunk is a synchronous callback
Can't yield directly from callback
Use polling mechanism to yield as chunks arrive
10ms polling interval balances responsiveness vs CPU usage

Real-World Examples

Example 1: Simple Text Generation

const llm = new LlamaCppLLM({
  modelPath: './models/Meta-Llama-3.1-8B-Instruct-Q5_K_S.gguf',
  temperature: 0.7,
  maxTokens: 100
});

// Simple string input
const response = await llm.invoke("What is 2+2?");
console.log(response.content); // "2+2 equals 4."

Example 2: Conversation with System Prompt

const messages = [
  new SystemMessage("You are a helpful math tutor."),
  new HumanMessage("What is 5*5?")
];

const response = await llm.invoke(messages);
console.log(response.content);
// "5 times 5 is 25. Here's a simple explanation..."

Example 3: Using Qwen Chat Wrapper

import { QwenChatWrapper } from 'node-llama-cpp';

const llm = new LlamaCppLLM({
  modelPath: './models/Qwen3-1.7B-Q6_K.gguf',
  temperature: 0.7,
  chatWrapper: new QwenChatWrapper({
    thoughts: 'discourage'  // Prevents thinking tokens
  })
});

const response = await llm.invoke("What is AI?");
// Response won't include <think> tokens

Example 4: Temperature Comparison

const question = "Give me one adjective to describe winter:";

// Low temperature - consistent answers
llm._chatSession.setChatHistory([]);
const lowTemp = await llm.invoke(question, { temperature: 0.1 });
// Likely: "cold"

// High temperature - varied answers  
llm._chatSession.setChatHistory([]);
const highTemp = await llm.invoke(question, { temperature: 0.9 });
// Could be: "frosty", "snowy", "icy", "chilly"

Example 5: Streaming Output

console.log('Response: ');
for await (const chunk of llm.stream("Tell me a fun fact about space")) {
  process.stdout.write(chunk.content); // No newline
}
console.log('\n');

// Output streams in real-time as it's generated

Example 6: Batch Processing

const questions = [
  "What is Python?",
  "What is JavaScript?",
  "What is Rust?"
];

const answers = await llm.batch(questions);

questions.forEach((q, i) => {
  console.log(`Q: ${q}`);
  console.log(`A: ${answers[i].content}`);
  console.log();
});

// Each answer is independent - no history contamination!

Example 7: Using in a Pipeline

import { PromptTemplate } from '../prompts/prompt-template.js';

const prompt = PromptTemplate.fromTemplate(
  "Translate the following to {language}: {text}"
);

const chain = prompt.pipe(llm);

const result = await chain.invoke({
  language: "Spanish",
  text: "Hello, how are you?"
});

console.log(result.content); // "Hola, ¿cómo estás?"

Advanced Patterns

Pattern 1: Model Pool (Reusing Loaded Models)

class LLMPool {
  constructor() {
    this.models = new Map();
  }

  async get(modelPath, options = {}) {
    if (!this.models.has(modelPath)) {
      const llm = new LlamaCppLLM({ modelPath, ...options });
      await llm._initialize(); // Pre-load
      this.models.set(modelPath, llm);
    }
    return this.models.get(modelPath);
  }

  async disposeAll() {
    for (const llm of this.models.values()) {
      await llm.dispose();
    }
    this.models.clear();
  }
}

// Usage
const pool = new LLMPool();
const llm = await pool.get('./models/llama-3.1-8b.gguf');

Pattern 2: Retry on Failure

class ReliableLLM extends LlamaCppLLM {
  async _call(input, config = {}) {
    const maxRetries = config.maxRetries || 3;
    let lastError;

    for (let i = 0; i < maxRetries; i++) {
      try {
        return await super._call(input, config);
      } catch (error) {
        lastError = error;
        console.warn(`Attempt ${i + 1} failed, retrying...`);
        await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1)));
      }
    }

    throw new Error(`All ${maxRetries} attempts failed: ${lastError.message}`);
  }
}

Pattern 3: Token Counting

class LlamaCppLLMWithCounting extends LlamaCppLLM {
  constructor(options) {
    super(options);
    this.totalTokens = 0;
  }

  async _call(input, config = {}) {
    const result = await super._call(input, config);

    // Rough token estimation (4 chars ≈ 1 token)
    const promptTokens = Math.ceil(JSON.stringify(input).length / 4);
    const completionTokens = Math.ceil(result.content.length / 4);

    this.totalTokens += promptTokens + completionTokens;

    result.additionalKwargs.usage = {
      promptTokens,
      completionTokens,
      totalTokens: promptTokens + completionTokens
    };

    return result;
  }

  getUsage() {
    return { totalTokens: this.totalTokens };
  }
}

Common Patterns and Best Practices

✅ DO:

// Initialize once, use many times
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
await llm.invoke("Question 1");
await llm.invoke("Question 2");
await llm.dispose(); // Cleanup when done

// Use Messages for structure
const messages = [
  new SystemMessage("You are helpful"),
  new HumanMessage("Hi")
];

// Clear history for independent calls
const response = await llm.invoke(messages, { clearHistory: true });

// Handle errors gracefully
try {
  const result = await llm.invoke(messages);
} catch (error) {
  console.error('Generation failed:', error);
}

❌ DON'T:

// Don't create new LLM for each request (slow!)
for (const question of questions) {
  const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
  await llm.invoke(question); // Loads model every time!
}

// Don't forget to clear history in batch processing
// This will cause history contamination!
for (const q of questions) {
  await llm.invoke(q); // Sees all previous questions!
}

// Don't forget cleanup
// Missing: await llm.dispose()

Performance Tips

Tip 1: Preload Models

// Load during app startup
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
await llm._initialize(); // Force load now

// Later requests are instant
await llm.invoke("Fast response!");

Tip 2: Use Batch Properly

// This correctly isolates each question
const answers = await llm.batch(questions);

// Not: Sequential with contamination
for (const q of questions) {
  await llm.invoke(q); // History builds up!
}

Tip 3: Adjust Context Size

// Smaller context = faster, less memory
const fastLLM = new LlamaCppLLM({
  modelPath: './model.gguf',
  contextSize: 2048  // vs default 4096
});

Tip 4: Use Appropriate Temperature

// Factual answers: low temperature
const fact = await llm.invoke(query, { temperature: 0.1 });

// Creative writing: high temperature
const story = await llm.invoke(query, { temperature: 0.9 });

Debugging Tips

Tip 1: Enable Verbose Mode

const llm = new LlamaCppLLM({
  modelPath: './model.gguf',
  verbose: true  // Shows loading and generation details
});

Tip 2: Test History Isolation

// Test batch processing
const questions = ["Q1", "Q2", "Q3"];
const answers = await llm.batch(questions);

// Each answer should be independent
// If Q2 mentions Q1, history contamination occurred!

Tip 3: Verify Streaming

// Verify streaming works
console.log('Testing stream:');
for await (const chunk of llm.stream("Count to 5")) {
  console.log('Chunk:', chunk.content);
}

Common Mistakes

❌ Mistake 1: Not Clearing History in Batches

// Bad: History contamination
const pipeline = formatter.pipe(llm).pipe(parser);
const results = await pipeline.batch(inputs); // Q2 sees Q1!

Fix: The LlamaCppLLM.batch() method automatically clears history:

// Good: Each input is isolated
const results = await llm.batch(inputs);

❌ Mistake 2: Forgetting Random Seed

// Bad: Temperature doesn't work
const response = await llm.invoke(prompt, { temperature: 0.9 });
// Without random seed, might get same answer

Fix: Our implementation automatically adds random seed:

// Good: Randomness works properly
if (promptOptions.temperature > 0 && config.seed === undefined) {
  promptOptions.seed = Math.floor(Math.random() * 1000000);
}

❌ Mistake 3: Not Setting System Prompt Properly

// Bad: System prompt persists between calls
await llm.invoke([new SystemMessage("Be creative"), ...]);
await llm.invoke([new HumanMessage("Hi")]); // Still "creative"!

Fix: Always set system prompt (empty string to clear):

// Good: Always explicitly set or clear
this._chatSession.systemPrompt = systemPrompt || '';

Mental Model

Think of the LLM wrapper as managing a conversation session:

Call 1: [System: "Be helpful", User: "Hi"]
        ↓
      Model generates response
        ↓
      Returns: AIMessage("Hello!")

Call 2: [User: "How are you?"]  
        ↓
      PROBLEM: Still has "Be helpful" system prompt!
      PROBLEM: Might remember "Hi" conversation!
        
SOLUTION: Clear history + reset system prompt between calls

The wrapper handles:

Loading models once
Converting Messages to chat history
Managing system prompts
Clearing history when needed
Streaming chunks
Random seeds for temperature
Error handling

Summary

Congratulations! You now understand how to wrap a complex LLM library as a clean, composable Runnable with proper state management.

Key Takeaways

Lazy loading saves time: Load models only when needed
Messages enable structure: Proper conversation formatting
History isolation prevents bugs: Critical for batch processing
System prompts must be managed: Always set or clear explicitly
Streaming improves UX: Real-time output feels responsive
Random seeds enable temperature: Required for randomness
Chat wrappers add flexibility: Support different models
Sequential batch processing: Local models can't truly parallelize

What You Built

A LLM wrapper that:

✅ Loads models lazily
✅ Handles Messages properly
✅ Manages chat history correctly
✅ Isolates batches
✅ Supports streaming
✅ Handles system prompts
✅ Supports chat wrappers
✅ Adds random seeds for temperature
✅ Provides good error messages

Critical Implementation Details

// 1. Always clear and set system prompt
this._chatSession.systemPrompt = systemPrompt || '';

// 2. Use clearHistory for batch isolation
async batch(inputs, config = {}) {
  const results = [];
  for (const input of inputs) {
    const result = await this._call(input, { 
      ...config, 
      clearHistory: true 
    });
    results.push(result);
  }
  return results;
}

// 3. Add random seed for temperature
if (promptOptions.temperature > 0 && config.seed === undefined) {
  promptOptions.seed = Math.floor(Math.random() * 1000000);
}

// 4. Use correct parameter names
customStopTriggers: config.stopStrings ?? this.stopStrings

What's Next

In the next lesson, we'll explore Context & Configuration - how to pass state and settings through chains.

Preview: You'll learn:

RunnableConfig object
Callback systems
Metadata tracking
Debug modes

➡️ Continue to Lesson 4: Context & Configuration

Additional Resources

Questions & Discussion

Q: Why do we always set system prompt instead of only when present?

A: To prevent contamination. If call 1 sets a system prompt but call 2 doesn't, call 2 would still use call 1's system prompt. Always setting (even to empty string) ensures clean state.

Q: Why sequential batch processing instead of parallel?

A: Local models (node-llama-cpp) can't run true parallel inference on a single model instance. The library serializes requests internally, so parallel Promise.all() provides no benefit and can cause race conditions on the shared chat session.

Q: Why do we need random seeds for temperature?

A: The node-llama-cpp library states: "The randomness of the temperature can be controlled by the seed parameter. Setting a specific seed and a specific temperature will yield the same response every time for the same input." Without a random seed, high temperature might still give deterministic results.

Q: Can I use multiple models simultaneously?

A: Yes! Each LlamaCppLLM instance can have a different model. Just be aware of memory constraints - each model takes several GB of RAM.

Q: What's the difference between customStopTriggers and stopStrings?

A: customStopTriggers is the correct parameter name in node-llama-cpp. We accept stopStrings in our config for a more intuitive API, then map it to customStopTriggers internally.

Built with ❤️ for learners who want to understand AI agents deeply

← Previous: Messages | Tutorial Index | Next: Context →