Spaces:

lenzcom
/

Email

Running

File size: 30,487 Bytes

e706de2

# The LLM Wrapper

**Part 1: Foundation - Lesson 3**

> Wrapping node-llama-cpp as a Runnable for seamless integration

## Overview

In Lesson 1, you learned about Runnables - the composable interface. In Lesson 2, you mastered Messages - the data structures. Now we'll connect these concepts by wrapping **node-llama-cpp** (our local LLM) as a Runnable that understands Messages.

By the end of this lesson, you'll have a LLM wrapper that can generate text, handle conversations, stream responses, and integrate seamlessly with chains.

## Why Does This Matter?

### The Problem: LLMs Don't Compose

node-llama-cpp is excellent at what it does - running local LLMs efficiently. But when you're building agents, you need more than just an LLM. You need components that work together seamlessly.

**Without a composable framework:**
```javascript

import { getLlama } from 'node-llama-cpp';



// Each component is isolated - they don't know about each other

async function myAgent(userInput) {

    // Step 1: Format the prompt

    const prompt = myCustomFormatter(userInput);



    // Step 2: Call the LLM

    const llama = await getLlama();

    const model = await llama.loadModel({ modelPath: './model.gguf' });

    const response = await model.createCompletion(prompt);



    // Step 3: Parse the response

    const parsed = myCustomParser(response);



    // Step 4: Maybe call a tool?

    if (parsed.needsTool) {

        const toolResult = await myTool(parsed.args);

        // Now what? Call the LLM again? How do we loop?

        // How do we add logging? Memory? Retries?

    }



    return parsed;

}



// Problems:

// - Can't reuse components

// - Can't chain operations

// - Hard to add logging, metrics, or debugging

// - Complex control flow for agents

// - Every new feature requires changing everything

```

**With a composable framework:**
```javascript

// Components that work together

const llm = new LlamaCppLLM({ modelPath: './model.gguf' });



// Simple usage

const response = await llm.invoke([

    new SystemMessage("You are helpful"),

    new HumanMessage("Hi")

]);

// Returns: AIMessage("Hello! How can I help you?")



// But the real power is composition

const agent = promptTemplate

    .pipe(llm)

    .pipe(outputParser)

    .pipe(toolExecutor);



// Now you can:

// ✅ Reuse components in different chains

// ✅ Add logging with callbacks (no code changes)

// ✅ Build complex agents that use tools

// ✅ Test each component independently

// ✅ Swap LLMs without rewriting everything

```

### What the Wrapper Provides

The LLM wrapper isn't about making node-llama-cpp easier - it's about making it **work with everything else**:

1. **Common Interface**: Same `invoke()` / `stream()` / `batch()` as every other component
2. **Message Support**: Understands HumanMessage, AIMessage, SystemMessage
3. **Composability**: Works with `.pipe()` to chain operations
4. **Observability**: Callbacks work automatically for logging/metrics
5. **Configuration**: Runtime settings pass through cleanly
6. **History Isolation**: Proper batch processing without contamination

Think of it as an adapter that lets node-llama-cpp play nicely with the rest of your agent system.

## Learning Objectives

By the end of this lesson, you will:

- ✅ Understand how to wrap complex libraries as Runnables
- ✅ Convert Messages to LLM chat history
- ✅ Handle model loading and lifecycle
- ✅ Implement streaming for real-time output
- ✅ Add temperature and other generation parameters
- ✅ Manage context windows and chat history
- ✅ Handle batch processing with history isolation

## Core Concepts

### What is an LLM Wrapper?

An LLM wrapper is an abstraction layer that:
1. **Hides complexity** - No need to manage contexts, sessions, or cleanup
2. **Provides a standard interface** - Same API regardless of underlying model
3. **Handles conversion** - Transforms Messages into model-specific chat history
4. **Manages resources** - Automatic initialization and cleanup
5. **Enables composition** - Works seamlessly in chains
6. **Isolates state** - Prevents history contamination in batch processing

### The Wrapper's Responsibilities

```

Input (Messages)

      ↓

[1. Convert to Chat History]

      ↓

[2. Manage System Prompt]

      ↓

[3. Call LLM]

      ↓

[4. Parse Response]

      ↓

Output (AIMessage)

```

### Key Challenges

1. **Model Loading**: Models are large and slow to load
2. **Chat History Format**: Convert Messages to node-llama-cpp format
3. **System Prompt Management**: Clear and set for each call
4. **Context Management**: Limited context windows
5. **Streaming**: Real-time output is complex
6. **Batch Isolation**: Prevent history contamination
7. **Error Handling**: Models can fail in various ways
8. **Chat Wrappers**: Different models need different formats

## Implementation Deep Dive

Let's build the LLM wrapper step by step.

### Step 1: The Base Structure

**Location:** `src/llm/llama-cpp-llm.js`
```javascript

import { Runnable } from './runnable.js';

import { AIMessage } from './message.js';

import { getLlama, LlamaChatSession } from 'node-llama-cpp';



export class LlamaCppLLM extends Runnable {

  constructor(options = {}) {

    super();



    // Model configuration

    this.modelPath = options.modelPath;

    this.temperature = options.temperature ?? 0.7;

    this.maxTokens = options.maxTokens ?? 2048;

    this.contextSize = options.contextSize ?? 4096;

    

    // Chat wrapper configuration (auto-detects by default)

    this.chatWrapper = options.chatWrapper ?? 'auto';



    // Internal state

    this._llama = null;

    this._model = null;

    this._context = null;

    this._chatSession = null;

    this._initialized = false;

  }



  async _call(input, config) {

    // Will implement next

  }

}

```

**Key decisions**:
- Stores configuration (temperature, max tokens, etc.)
- Supports custom chat wrappers (e.g., QwenChatWrapper)
- Tracks internal state (model, context, session)
- Lazy initialization (load on first use)

### Step 2: Model Initialization with Chat Wrapper Support

```javascript

export class LlamaCppLLM extends Runnable {

  // ... constructor ...



  /**

   * Initialize the model (lazy loading)

   */

  async _initialize() {

    if (this._initialized) return;



    if (this.verbose) {

      console.log(`Loading model: ${this.modelPath}`);

    }



    try {

      // Step 1: Get llama instance

      this._llama = await getLlama();



      // Step 2: Load the model

      this._model = await this._llama.loadModel({

        modelPath: this.modelPath

      });



      // Step 3: Create context (working memory)

      this._context = await this._model.createContext({

        contextSize: this.contextSize,

        batchSize: this.batchSize

      });



      // Step 4: Create chat session with optional chat wrapper

      const contextSequence = this._context.getSequence();

      const sessionConfig = { contextSequence };



      // Add custom chat wrapper if specified

      if (this.chatWrapper !== 'auto') {

        sessionConfig.chatWrapper = this.chatWrapper;

      }



      this._chatSession = new LlamaChatSession(sessionConfig);



      this._initialized = true;



      if (this.verbose) {

        console.log('✓ Model loaded successfully');

        if (this.chatWrapper !== 'auto') {

          console.log(`✓ Using custom chat wrapper: ${this.chatWrapper.constructor.name}`);

        }

      }

    } catch (error) {

      throw new Error(

        `Failed to initialize model at ${this.modelPath}: ${error.message}`

      );

    }

  }



  /**

   * Cleanup resources

   */

  async dispose() {

    if (this._context) {

      await this._context.dispose();

      this._context = null;

    }

    if (this._model) {

      await this._model.dispose();

      this._model = null;

    }

    this._chatSession = null;

    this._initialized = false;



    if (this.verbose) {

      console.log('✓ Model resources disposed');

    }

  }

}

```

**Why lazy loading?**
- Models take 5-30 seconds to load
- Don't load until actually needed
- Share one loaded model across multiple calls

**Chat Wrapper Support**:
- Defaults to 'auto' (library auto-detects)
- Supports custom wrappers like QwenChatWrapper for specific models
- Useful for controlling model behavior (e.g., discouraging thoughts)

### Step 3: Converting Messages to Chat History

```javascript

export class LlamaCppLLM extends Runnable {

  // ... previous code ...



  /**

   * Convert our Message objects to node-llama-cpp chat history format

   */

  _messagesToChatHistory(messages) {

    return messages.map(msg => {

      // System messages: instructions for the AI

      if (msg._type === 'system') {

        return { type: 'system', text: msg.content };

      }

      // Human messages: user input

      else if (msg._type === 'human') {

        return { type: 'user', text: msg.content };

      }

      // AI messages: previous AI responses

      else if (msg._type === 'ai') {

        return { type: 'model', response: msg.content };

      }

      // Tool messages: results from tool execution

      else if (msg._type === 'tool') {

        return { type: 'system', text: `Tool Result: ${msg.content}` };

      }



      // Fallback: treat unknown types as user messages

      return { type: 'user', text: msg.content };

    });

  }

}

```

**Key insight**: This bridges between your standardized Message types and what node-llama-cpp expects. Different models may need different chat formats, which is why chat wrappers exist.

### Step 4: The Main Generation Method

```javascript

export class LlamaCppLLM extends Runnable {

  // ... previous code ...



  async _call(input, config = {}) {

    // Initialize if needed

    await this._initialize();



    // Clear history if requested (important for batch processing)

    if (config.clearHistory) {

      this._chatSession.setChatHistory([]);

    }



    // Handle different input types

    let messages;

    if (typeof input === 'string') {

      messages = [new HumanMessage(input)];

    } else if (Array.isArray(input)) {

      messages = input;

    } else {

      throw new Error('Input must be string or array of messages');

    }



    // Extract system message if present

    const systemMessages = messages.filter(msg => msg._type === 'system');

    const systemPrompt = systemMessages.length > 0

      ? systemMessages[0].content

      : '';



    // Convert our Message objects to llama.cpp format

    const chatHistory = this._messagesToChatHistory(messages);

    this._chatSession.setChatHistory(chatHistory);



    // ALWAYS set system prompt (either new value or empty string to clear)

    this._chatSession.systemPrompt = systemPrompt;



    try {

      // Build prompt options

      const promptOptions = {

        temperature: config.temperature ?? this.temperature,

        topP: config.topP ?? this.topP,

        topK: config.topK ?? this.topK,

        maxTokens: config.maxTokens ?? this.maxTokens,

        repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,

        customStopTriggers: config.stopStrings ?? this.stopStrings

      };



      // Add random seed if temperature > 0 and no seed specified

      // This ensures randomness works properly

      if (promptOptions.temperature > 0 && config.seed === undefined) {

        promptOptions.seed = Math.floor(Math.random() * 1000000);

      } else if (config.seed !== undefined) {

        promptOptions.seed = config.seed;

      }



      // Generate response using prompt

      const response = await this._chatSession.prompt('', promptOptions);



      // Return as AIMessage for consistency

      return new AIMessage(response);

    } catch (error) {

      throw new Error(`Generation failed: ${error.message}`);

    }

  }

}

```

**Critical details**:
- Always clears and sets system prompt (prevents contamination)
- Adds random seed for proper temperature behavior
- Uses `customStopTriggers` (correct parameter name)
- Supports `clearHistory` for batch processing

### Step 5: Batch Processing with History Isolation

```javascript

export class LlamaCppLLM extends Runnable {

  // ... previous code ...



  /**

   * Batch processing with history isolation

   * 

   * Processes multiple inputs sequentially, ensuring each gets 

   * a clean chat history to prevent contamination.

   */

  async batch(inputs, config = {}) {

    const results = [];

    for (const input of inputs) {

      // Clear history before each batch item

      const result = await this._call(input, { 

        ...config, 

        clearHistory: true 

      });

      results.push(result);

    }

    return results;

  }

}

```

**Why sequential processing?**
- Local models can't run truly in parallel
- Sequential ensures proper history isolation
- Each item gets a clean slate

### Step 6: Streaming Support

For real-time output (like ChatGPT's typing effect):

```javascript

export class LlamaCppLLM extends Runnable {

  // ... previous code ...



  async *_stream(input, config = {}) {

    await this._initialize();



    // Clear history if requested

    if (config.clearHistory) {

      this._chatSession.setChatHistory([]);

    }



    // Handle input types (same as _call)

    let messages;

    if (typeof input === 'string') {

      messages = [new HumanMessage(input)];

    } else if (Array.isArray(input)) {

      messages = input;

    } else {

      throw new Error('Input must be string or array of messages');

    }



    // Extract system message

    const systemMessages = messages.filter(msg => msg._type === 'system');

    const systemPrompt = systemMessages.length > 0

      ? systemMessages[0].content

      : '';



    // Set up chat history

    const chatHistory = this._messagesToChatHistory(messages);

    this._chatSession.setChatHistory(chatHistory);



    // ALWAYS set system prompt

    this._chatSession.systemPrompt = systemPrompt;



    try {

      // Build prompt options

      const promptOptions = {

        temperature: config.temperature ?? this.temperature,

        topP: config.topP ?? this.topP,

        topK: config.topK ?? this.topK,

        maxTokens: config.maxTokens ?? this.maxTokens,

        repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,

        customStopTriggers: config.stopStrings ?? this.stopStrings

      };



      // Add random seed

      if (promptOptions.temperature > 0 && config.seed === undefined) {

        promptOptions.seed = Math.floor(Math.random() * 1000000);

      } else if (config.seed !== undefined) {

        promptOptions.seed = config.seed;

      }



      // Use onTextChunk callback to collect chunks

      const self = this;

      promptOptions.onTextChunk = (chunk) => {

        self._currentStreamChunks = self._currentStreamChunks || [];

        self._currentStreamChunks.push(chunk);

      };



      // Initialize chunk collection

      this._currentStreamChunks = [];



      // Start generation

      const responsePromise = this._chatSession.prompt('', promptOptions);



      // Yield chunks as they become available

      let lastYieldedIndex = 0;



      // Poll for new chunks

      while (true) {

        // Yield any new chunks

        while (lastYieldedIndex < this._currentStreamChunks.length) {

          yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {

            additionalKwargs: { chunk: true }

          });

          lastYieldedIndex++;

        }



        // Check if generation is complete

        const isDone = await Promise.race([

          responsePromise.then(() => true),

          new Promise(resolve => setTimeout(() => resolve(false), 10))

        ]);



        if (isDone) {

          // Yield any remaining chunks

          while (lastYieldedIndex < this._currentStreamChunks.length) {

            yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {

              additionalKwargs: { chunk: true }

            });

            lastYieldedIndex++;

          }

          break;

        }

      }



      // Wait for completion

      await responsePromise;



      // Clean up

      delete this._currentStreamChunks;



    } catch (error) {

      throw new Error(`Streaming failed: ${error.message}`);

    }

  }

}

```

**Streaming challenges**:
- `onTextChunk` is a synchronous callback
- Can't yield directly from callback
- Use polling mechanism to yield as chunks arrive
- 10ms polling interval balances responsiveness vs CPU usage

## Real-World Examples

### Example 1: Simple Text Generation

```javascript

const llm = new LlamaCppLLM({

  modelPath: './models/Meta-Llama-3.1-8B-Instruct-Q5_K_S.gguf',

  temperature: 0.7,

  maxTokens: 100

});



// Simple string input

const response = await llm.invoke("What is 2+2?");

console.log(response.content); // "2+2 equals 4."

```

### Example 2: Conversation with System Prompt

```javascript

const messages = [

  new SystemMessage("You are a helpful math tutor."),

  new HumanMessage("What is 5*5?")

];



const response = await llm.invoke(messages);

console.log(response.content);

// "5 times 5 is 25. Here's a simple explanation..."

```

### Example 3: Using Qwen Chat Wrapper

```javascript

import { QwenChatWrapper } from 'node-llama-cpp';



const llm = new LlamaCppLLM({

  modelPath: './models/Qwen3-1.7B-Q6_K.gguf',

  temperature: 0.7,

  chatWrapper: new QwenChatWrapper({

    thoughts: 'discourage'  // Prevents thinking tokens

  })

});



const response = await llm.invoke("What is AI?");

// Response won't include <think> tokens

```

### Example 4: Temperature Comparison

```javascript

const question = "Give me one adjective to describe winter:";



// Low temperature - consistent answers

llm._chatSession.setChatHistory([]);

const lowTemp = await llm.invoke(question, { temperature: 0.1 });

// Likely: "cold"



// High temperature - varied answers  

llm._chatSession.setChatHistory([]);

const highTemp = await llm.invoke(question, { temperature: 0.9 });

// Could be: "frosty", "snowy", "icy", "chilly"

```

### Example 5: Streaming Output

```javascript

console.log('Response: ');

for await (const chunk of llm.stream("Tell me a fun fact about space")) {

  process.stdout.write(chunk.content); // No newline

}

console.log('\n');



// Output streams in real-time as it's generated

```

### Example 6: Batch Processing

```javascript

const questions = [

  "What is Python?",

  "What is JavaScript?",

  "What is Rust?"

];



const answers = await llm.batch(questions);



questions.forEach((q, i) => {

  console.log(`Q: ${q}`);

  console.log(`A: ${answers[i].content}`);

  console.log();

});



// Each answer is independent - no history contamination!

```

### Example 7: Using in a Pipeline

```javascript

import { PromptTemplate } from '../prompts/prompt-template.js';



const prompt = PromptTemplate.fromTemplate(

  "Translate the following to {language}: {text}"

);



const chain = prompt.pipe(llm);



const result = await chain.invoke({

  language: "Spanish",

  text: "Hello, how are you?"

});



console.log(result.content); // "Hola, ¿cómo estás?"

```

## Advanced Patterns

### Pattern 1: Model Pool (Reusing Loaded Models)

```javascript

class LLMPool {

  constructor() {

    this.models = new Map();

  }



  async get(modelPath, options = {}) {

    if (!this.models.has(modelPath)) {

      const llm = new LlamaCppLLM({ modelPath, ...options });

      await llm._initialize(); // Pre-load

      this.models.set(modelPath, llm);

    }

    return this.models.get(modelPath);

  }



  async disposeAll() {

    for (const llm of this.models.values()) {

      await llm.dispose();

    }

    this.models.clear();

  }

}



// Usage

const pool = new LLMPool();

const llm = await pool.get('./models/llama-3.1-8b.gguf');

```

### Pattern 2: Retry on Failure

```javascript

class ReliableLLM extends LlamaCppLLM {

  async _call(input, config = {}) {

    const maxRetries = config.maxRetries || 3;

    let lastError;



    for (let i = 0; i < maxRetries; i++) {

      try {

        return await super._call(input, config);

      } catch (error) {

        lastError = error;

        console.warn(`Attempt ${i + 1} failed, retrying...`);

        await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1)));

      }

    }



    throw new Error(`All ${maxRetries} attempts failed: ${lastError.message}`);

  }

}

```

### Pattern 3: Token Counting

```javascript

class LlamaCppLLMWithCounting extends LlamaCppLLM {

  constructor(options) {

    super(options);

    this.totalTokens = 0;

  }



  async _call(input, config = {}) {

    const result = await super._call(input, config);



    // Rough token estimation (4 chars ≈ 1 token)

    const promptTokens = Math.ceil(JSON.stringify(input).length / 4);

    const completionTokens = Math.ceil(result.content.length / 4);



    this.totalTokens += promptTokens + completionTokens;



    result.additionalKwargs.usage = {

      promptTokens,

      completionTokens,

      totalTokens: promptTokens + completionTokens

    };



    return result;

  }



  getUsage() {

    return { totalTokens: this.totalTokens };

  }

}

```

## Common Patterns and Best Practices

### ✅ DO:

```javascript

// Initialize once, use many times

const llm = new LlamaCppLLM({ modelPath: './model.gguf' });

await llm.invoke("Question 1");

await llm.invoke("Question 2");

await llm.dispose(); // Cleanup when done



// Use Messages for structure

const messages = [

  new SystemMessage("You are helpful"),

  new HumanMessage("Hi")

];



// Clear history for independent calls

const response = await llm.invoke(messages, { clearHistory: true });



// Handle errors gracefully

try {

  const result = await llm.invoke(messages);

} catch (error) {

  console.error('Generation failed:', error);

}

```

### ❌ DON'T:

```javascript

// Don't create new LLM for each request (slow!)

for (const question of questions) {

  const llm = new LlamaCppLLM({ modelPath: './model.gguf' });

  await llm.invoke(question); // Loads model every time!

}



// Don't forget to clear history in batch processing

// This will cause history contamination!

for (const q of questions) {

  await llm.invoke(q); // Sees all previous questions!

}



// Don't forget cleanup

// Missing: await llm.dispose()

```

## Performance Tips

### Tip 1: Preload Models

```javascript

// Load during app startup

const llm = new LlamaCppLLM({ modelPath: './model.gguf' });

await llm._initialize(); // Force load now



// Later requests are instant

await llm.invoke("Fast response!");

```

### Tip 2: Use Batch Properly

```javascript

// This correctly isolates each question

const answers = await llm.batch(questions);



// Not: Sequential with contamination

for (const q of questions) {

  await llm.invoke(q); // History builds up!

}

```

### Tip 3: Adjust Context Size

```javascript

// Smaller context = faster, less memory

const fastLLM = new LlamaCppLLM({

  modelPath: './model.gguf',

  contextSize: 2048  // vs default 4096

});

```

### Tip 4: Use Appropriate Temperature

```javascript

// Factual answers: low temperature

const fact = await llm.invoke(query, { temperature: 0.1 });



// Creative writing: high temperature

const story = await llm.invoke(query, { temperature: 0.9 });

```

## Debugging Tips

### Tip 1: Enable Verbose Mode

```javascript

const llm = new LlamaCppLLM({

  modelPath: './model.gguf',

  verbose: true  // Shows loading and generation details

});

```

### Tip 2: Test History Isolation

```javascript

// Test batch processing

const questions = ["Q1", "Q2", "Q3"];

const answers = await llm.batch(questions);



// Each answer should be independent

// If Q2 mentions Q1, history contamination occurred!

```

### Tip 3: Verify Streaming

```javascript

// Verify streaming works

console.log('Testing stream:');

for await (const chunk of llm.stream("Count to 5")) {

  console.log('Chunk:', chunk.content);

}

```

## Common Mistakes

### ❌ Mistake 1: Not Clearing History in Batches

```javascript

// Bad: History contamination

const pipeline = formatter.pipe(llm).pipe(parser);

const results = await pipeline.batch(inputs); // Q2 sees Q1!

```

**Fix**: The LlamaCppLLM.batch() method automatically clears history:
```javascript

// Good: Each input is isolated

const results = await llm.batch(inputs);

```

### ❌ Mistake 2: Forgetting Random Seed

```javascript

// Bad: Temperature doesn't work

const response = await llm.invoke(prompt, { temperature: 0.9 });

// Without random seed, might get same answer

```

**Fix**: Our implementation automatically adds random seed:
```javascript

// Good: Randomness works properly

if (promptOptions.temperature > 0 && config.seed === undefined) {

  promptOptions.seed = Math.floor(Math.random() * 1000000);

}

```

### ❌ Mistake 3: Not Setting System Prompt Properly

```javascript

// Bad: System prompt persists between calls

await llm.invoke([new SystemMessage("Be creative"), ...]);

await llm.invoke([new HumanMessage("Hi")]); // Still "creative"!

```

**Fix**: Always set system prompt (empty string to clear):
```javascript

// Good: Always explicitly set or clear

this._chatSession.systemPrompt = systemPrompt || '';

```

## Mental Model

Think of the LLM wrapper as managing a conversation session:

```

Call 1: [System: "Be helpful", User: "Hi"]

        ↓

      Model generates response

        ↓

      Returns: AIMessage("Hello!")



Call 2: [User: "How are you?"]  

        ↓

      PROBLEM: Still has "Be helpful" system prompt!

      PROBLEM: Might remember "Hi" conversation!

        

SOLUTION: Clear history + reset system prompt between calls

```

The wrapper handles:
- Loading models once
- Converting Messages to chat history
- Managing system prompts
- Clearing history when needed
- Streaming chunks
- Random seeds for temperature
- Error handling

## Summary

Congratulations! You now understand how to wrap a complex LLM library as a clean, composable Runnable with proper state management.

### Key Takeaways

1. **Lazy loading saves time**: Load models only when needed
2. **Messages enable structure**: Proper conversation formatting
3. **History isolation prevents bugs**: Critical for batch processing
4. **System prompts must be managed**: Always set or clear explicitly
5. **Streaming improves UX**: Real-time output feels responsive
6. **Random seeds enable temperature**: Required for randomness
7. **Chat wrappers add flexibility**: Support different models
8. **Sequential batch processing**: Local models can't truly parallelize

### What You Built

A LLM wrapper that:
- ✅ Loads models lazily
- ✅ Handles Messages properly
- ✅ Manages chat history correctly
- ✅ Isolates batches
- ✅ Supports streaming
- ✅ Handles system prompts
- ✅ Supports chat wrappers
- ✅ Adds random seeds for temperature
- ✅ Provides good error messages

### Critical Implementation Details

```javascript

// 1. Always clear and set system prompt

this._chatSession.systemPrompt = systemPrompt || '';



// 2. Use clearHistory for batch isolation

async batch(inputs, config = {}) {

  const results = [];

  for (const input of inputs) {

    const result = await this._call(input, { 

      ...config, 

      clearHistory: true 

    });

    results.push(result);

  }

  return results;

}



// 3. Add random seed for temperature

if (promptOptions.temperature > 0 && config.seed === undefined) {

  promptOptions.seed = Math.floor(Math.random() * 1000000);

}



// 4. Use correct parameter names

customStopTriggers: config.stopStrings ?? this.stopStrings

```

### What's Next

In the next lesson, we'll explore **Context & Configuration** - how to pass state and settings through chains.

**Preview**: You'll learn:
- RunnableConfig object
- Callback systems
- Metadata tracking
- Debug modes

➡️ [Continue to Lesson 4: Context & Configuration](04-context.md)

## Additional Resources

- [node-llama-cpp Documentation](https://node-llama-cpp.withcat.ai)
- [Chat Wrappers Guide](https://node-llama-cpp.withcat.ai/guide/chat-wrapper)
- [Temperature Guide](https://node-llama-cpp.withcat.ai/guide/chat-session#temperature)
- [GGUF Model Format](https://huggingface.co/docs/hub/gguf)

## Questions & Discussion

**Q: Why do we always set system prompt instead of only when present?**

A: To prevent contamination. If call 1 sets a system prompt but call 2 doesn't, call 2 would still use call 1's system prompt. Always setting (even to empty string) ensures clean state.

**Q: Why sequential batch processing instead of parallel?**

A: Local models (node-llama-cpp) can't run true parallel inference on a single model instance. The library serializes requests internally, so parallel Promise.all() provides no benefit and can cause race conditions on the shared chat session.

**Q: Why do we need random seeds for temperature?**

A: The node-llama-cpp library states: "The randomness of the temperature can be controlled by the seed parameter. Setting a specific seed and a specific temperature will yield the same response every time for the same input." Without a random seed, high temperature might still give deterministic results.

**Q: Can I use multiple models simultaneously?**

A: Yes! Each LlamaCppLLM instance can have a different model. Just be aware of memory constraints - each model takes several GB of RAM.

**Q: What's the difference between customStopTriggers and stopStrings?**

A: `customStopTriggers` is the correct parameter name in node-llama-cpp. We accept `stopStrings` in our config for a more intuitive API, then map it to `customStopTriggers` internally.

---

**Built with ❤️ for learners who want to understand AI agents deeply**

[← Previous: Messages](02-messages.md) | [Tutorial Index](../README.md) | [Next: Context →](04-context.md)