# The LLM Wrapper

**Part 1: Foundation - Lesson 3**

> Wrapping node-llama-cpp as a Runnable for seamless integration

## Overview

In Lesson 1, you learned about Runnables - the composable interface. In Lesson 2, you mastered Messages - the data structures. Now we'll connect these concepts by wrapping **node-llama-cpp** (our local LLM) as a Runnable that understands Messages.

By the end of this lesson, you'll have a LLM wrapper that can generate text, handle conversations, stream responses, and integrate seamlessly with chains.

## Why Does This Matter?

### The Problem: LLMs Don't Compose

node-llama-cpp is excellent at what it does - running local LLMs efficiently. But when you're building agents, you need more than just an LLM. You need components that work together seamlessly.

**Without a composable framework:**
```javascript
import { getLlama } from 'node-llama-cpp';

// Each component is isolated - they don't know about each other
async function myAgent(userInput) {
    // Step 1: Format the prompt
    const prompt = myCustomFormatter(userInput);

    // Step 2: Call the LLM
    const llama = await getLlama();
    const model = await llama.loadModel({ modelPath: './model.gguf' });
    const response = await model.createCompletion(prompt);

    // Step 3: Parse the response
    const parsed = myCustomParser(response);

    // Step 4: Maybe call a tool?
    if (parsed.needsTool) {
        const toolResult = await myTool(parsed.args);
        // Now what? Call the LLM again? How do we loop?
        // How do we add logging? Memory? Retries?
    }

    return parsed;
}

// Problems:
// - Can't reuse components
// - Can't chain operations
// - Hard to add logging, metrics, or debugging
// - Complex control flow for agents
// - Every new feature requires changing everything
```

**With a composable framework:**
```javascript
// Components that work together
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });

// Simple usage
const response = await llm.invoke([
    new SystemMessage("You are helpful"),
    new HumanMessage("Hi")
]);
// Returns: AIMessage("Hello! How can I help you?")

// But the real power is composition
const agent = promptTemplate
    .pipe(llm)
    .pipe(outputParser)
    .pipe(toolExecutor);

// Now you can:
// ✅ Reuse components in different chains
// ✅ Add logging with callbacks (no code changes)
// ✅ Build complex agents that use tools
// ✅ Test each component independently
// ✅ Swap LLMs without rewriting everything
```

### What the Wrapper Provides

The LLM wrapper isn't about making node-llama-cpp easier - it's about making it **work with everything else**:

1. **Common Interface**: Same `invoke()` / `stream()` / `batch()` as every other component
2. **Message Support**: Understands HumanMessage, AIMessage, SystemMessage
3. **Composability**: Works with `.pipe()` to chain operations
4. **Observability**: Callbacks work automatically for logging/metrics
5. **Configuration**: Runtime settings pass through cleanly
6. **History Isolation**: Proper batch processing without contamination

Think of it as an adapter that lets node-llama-cpp play nicely with the rest of your agent system.

## Learning Objectives

By the end of this lesson, you will:

- ✅ Understand how to wrap complex libraries as Runnables
- ✅ Convert Messages to LLM chat history
- ✅ Handle model loading and lifecycle
- ✅ Implement streaming for real-time output
- ✅ Add temperature and other generation parameters
- ✅ Manage context windows and chat history
- ✅ Handle batch processing with history isolation

## Core Concepts

### What is an LLM Wrapper?

An LLM wrapper is an abstraction layer that:
1. **Hides complexity** - No need to manage contexts, sessions, or cleanup
2. **Provides a standard interface** - Same API regardless of underlying model
3. **Handles conversion** - Transforms Messages into model-specific chat history
4. **Manages resources** - Automatic initialization and cleanup
5. **Enables composition** - Works seamlessly in chains
6. **Isolates state** - Prevents history contamination in batch processing

### The Wrapper's Responsibilities

```
Input (Messages)
      ↓
[1. Convert to Chat History]
      ↓
[2. Manage System Prompt]
      ↓
[3. Call LLM]
      ↓
[4. Parse Response]
      ↓
Output (AIMessage)
```

### Key Challenges

1. **Model Loading**: Models are large and slow to load
2. **Chat History Format**: Convert Messages to node-llama-cpp format
3. **System Prompt Management**: Clear and set for each call
4. **Context Management**: Limited context windows
5. **Streaming**: Real-time output is complex
6. **Batch Isolation**: Prevent history contamination
7. **Error Handling**: Models can fail in various ways
8. **Chat Wrappers**: Different models need different formats

## Implementation Deep Dive

Let's build the LLM wrapper step by step.

### Step 1: The Base Structure

**Location:** `src/llm/llama-cpp-llm.js`
```javascript
import { Runnable } from './runnable.js';
import { AIMessage } from './message.js';
import { getLlama, LlamaChatSession } from 'node-llama-cpp';

export class LlamaCppLLM extends Runnable {
  constructor(options = {}) {
    super();

    // Model configuration
    this.modelPath = options.modelPath;
    this.temperature = options.temperature ?? 0.7;
    this.maxTokens = options.maxTokens ?? 2048;
    this.contextSize = options.contextSize ?? 4096;
    
    // Chat wrapper configuration (auto-detects by default)
    this.chatWrapper = options.chatWrapper ?? 'auto';

    // Internal state
    this._llama = null;
    this._model = null;
    this._context = null;
    this._chatSession = null;
    this._initialized = false;
  }

  async _call(input, config) {
    // Will implement next
  }
}
```

**Key decisions**:
- Stores configuration (temperature, max tokens, etc.)
- Supports custom chat wrappers (e.g., QwenChatWrapper)
- Tracks internal state (model, context, session)
- Lazy initialization (load on first use)

### Step 2: Model Initialization with Chat Wrapper Support

```javascript
export class LlamaCppLLM extends Runnable {
  // ... constructor ...

  /**
   * Initialize the model (lazy loading)
   */
  async _initialize() {
    if (this._initialized) return;

    if (this.verbose) {
      console.log(`Loading model: ${this.modelPath}`);
    }

    try {
      // Step 1: Get llama instance
      this._llama = await getLlama();

      // Step 2: Load the model
      this._model = await this._llama.loadModel({
        modelPath: this.modelPath
      });

      // Step 3: Create context (working memory)
      this._context = await this._model.createContext({
        contextSize: this.contextSize,
        batchSize: this.batchSize
      });

      // Step 4: Create chat session with optional chat wrapper
      const contextSequence = this._context.getSequence();
      const sessionConfig = { contextSequence };

      // Add custom chat wrapper if specified
      if (this.chatWrapper !== 'auto') {
        sessionConfig.chatWrapper = this.chatWrapper;
      }

      this._chatSession = new LlamaChatSession(sessionConfig);

      this._initialized = true;

      if (this.verbose) {
        console.log('✓ Model loaded successfully');
        if (this.chatWrapper !== 'auto') {
          console.log(`✓ Using custom chat wrapper: ${this.chatWrapper.constructor.name}`);
        }
      }
    } catch (error) {
      throw new Error(
        `Failed to initialize model at ${this.modelPath}: ${error.message}`
      );
    }
  }

  /**
   * Cleanup resources
   */
  async dispose() {
    if (this._context) {
      await this._context.dispose();
      this._context = null;
    }
    if (this._model) {
      await this._model.dispose();
      this._model = null;
    }
    this._chatSession = null;
    this._initialized = false;

    if (this.verbose) {
      console.log('✓ Model resources disposed');
    }
  }
}
```

**Why lazy loading?**
- Models take 5-30 seconds to load
- Don't load until actually needed
- Share one loaded model across multiple calls

**Chat Wrapper Support**:
- Defaults to 'auto' (library auto-detects)
- Supports custom wrappers like QwenChatWrapper for specific models
- Useful for controlling model behavior (e.g., discouraging thoughts)

### Step 3: Converting Messages to Chat History

```javascript
export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  /**
   * Convert our Message objects to node-llama-cpp chat history format
   */
  _messagesToChatHistory(messages) {
    return messages.map(msg => {
      // System messages: instructions for the AI
      if (msg._type === 'system') {
        return { type: 'system', text: msg.content };
      }
      // Human messages: user input
      else if (msg._type === 'human') {
        return { type: 'user', text: msg.content };
      }
      // AI messages: previous AI responses
      else if (msg._type === 'ai') {
        return { type: 'model', response: msg.content };
      }
      // Tool messages: results from tool execution
      else if (msg._type === 'tool') {
        return { type: 'system', text: `Tool Result: ${msg.content}` };
      }

      // Fallback: treat unknown types as user messages
      return { type: 'user', text: msg.content };
    });
  }
}
```

**Key insight**: This bridges between your standardized Message types and what node-llama-cpp expects. Different models may need different chat formats, which is why chat wrappers exist.

### Step 4: The Main Generation Method

```javascript
export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  async _call(input, config = {}) {
    // Initialize if needed
    await this._initialize();

    // Clear history if requested (important for batch processing)
    if (config.clearHistory) {
      this._chatSession.setChatHistory([]);
    }

    // Handle different input types
    let messages;
    if (typeof input === 'string') {
      messages = [new HumanMessage(input)];
    } else if (Array.isArray(input)) {
      messages = input;
    } else {
      throw new Error('Input must be string or array of messages');
    }

    // Extract system message if present
    const systemMessages = messages.filter(msg => msg._type === 'system');
    const systemPrompt = systemMessages.length > 0
      ? systemMessages[0].content
      : '';

    // Convert our Message objects to llama.cpp format
    const chatHistory = this._messagesToChatHistory(messages);
    this._chatSession.setChatHistory(chatHistory);

    // ALWAYS set system prompt (either new value or empty string to clear)
    this._chatSession.systemPrompt = systemPrompt;

    try {
      // Build prompt options
      const promptOptions = {
        temperature: config.temperature ?? this.temperature,
        topP: config.topP ?? this.topP,
        topK: config.topK ?? this.topK,
        maxTokens: config.maxTokens ?? this.maxTokens,
        repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,
        customStopTriggers: config.stopStrings ?? this.stopStrings
      };

      // Add random seed if temperature > 0 and no seed specified
      // This ensures randomness works properly
      if (promptOptions.temperature > 0 && config.seed === undefined) {
        promptOptions.seed = Math.floor(Math.random() * 1000000);
      } else if (config.seed !== undefined) {
        promptOptions.seed = config.seed;
      }

      // Generate response using prompt
      const response = await this._chatSession.prompt('', promptOptions);

      // Return as AIMessage for consistency
      return new AIMessage(response);
    } catch (error) {
      throw new Error(`Generation failed: ${error.message}`);
    }
  }
}
```

**Critical details**:
- Always clears and sets system prompt (prevents contamination)
- Adds random seed for proper temperature behavior
- Uses `customStopTriggers` (correct parameter name)
- Supports `clearHistory` for batch processing

### Step 5: Batch Processing with History Isolation

```javascript
export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  /**
   * Batch processing with history isolation
   * 
   * Processes multiple inputs sequentially, ensuring each gets 
   * a clean chat history to prevent contamination.
   */
  async batch(inputs, config = {}) {
    const results = [];
    for (const input of inputs) {
      // Clear history before each batch item
      const result = await this._call(input, { 
        ...config, 
        clearHistory: true 
      });
      results.push(result);
    }
    return results;
  }
}
```

**Why sequential processing?**
- Local models can't run truly in parallel
- Sequential ensures proper history isolation
- Each item gets a clean slate

### Step 6: Streaming Support

For real-time output (like ChatGPT's typing effect):

```javascript
export class LlamaCppLLM extends Runnable {
  // ... previous code ...

  async *_stream(input, config = {}) {
    await this._initialize();

    // Clear history if requested
    if (config.clearHistory) {
      this._chatSession.setChatHistory([]);
    }

    // Handle input types (same as _call)
    let messages;
    if (typeof input === 'string') {
      messages = [new HumanMessage(input)];
    } else if (Array.isArray(input)) {
      messages = input;
    } else {
      throw new Error('Input must be string or array of messages');
    }

    // Extract system message
    const systemMessages = messages.filter(msg => msg._type === 'system');
    const systemPrompt = systemMessages.length > 0
      ? systemMessages[0].content
      : '';

    // Set up chat history
    const chatHistory = this._messagesToChatHistory(messages);
    this._chatSession.setChatHistory(chatHistory);

    // ALWAYS set system prompt
    this._chatSession.systemPrompt = systemPrompt;

    try {
      // Build prompt options
      const promptOptions = {
        temperature: config.temperature ?? this.temperature,
        topP: config.topP ?? this.topP,
        topK: config.topK ?? this.topK,
        maxTokens: config.maxTokens ?? this.maxTokens,
        repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,
        customStopTriggers: config.stopStrings ?? this.stopStrings
      };

      // Add random seed
      if (promptOptions.temperature > 0 && config.seed === undefined) {
        promptOptions.seed = Math.floor(Math.random() * 1000000);
      } else if (config.seed !== undefined) {
        promptOptions.seed = config.seed;
      }

      // Use onTextChunk callback to collect chunks
      const self = this;
      promptOptions.onTextChunk = (chunk) => {
        self._currentStreamChunks = self._currentStreamChunks || [];
        self._currentStreamChunks.push(chunk);
      };

      // Initialize chunk collection
      this._currentStreamChunks = [];

      // Start generation
      const responsePromise = this._chatSession.prompt('', promptOptions);

      // Yield chunks as they become available
      let lastYieldedIndex = 0;

      // Poll for new chunks
      while (true) {
        // Yield any new chunks
        while (lastYieldedIndex < this._currentStreamChunks.length) {
          yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {
            additionalKwargs: { chunk: true }
          });
          lastYieldedIndex++;
        }

        // Check if generation is complete
        const isDone = await Promise.race([
          responsePromise.then(() => true),
          new Promise(resolve => setTimeout(() => resolve(false), 10))
        ]);

        if (isDone) {
          // Yield any remaining chunks
          while (lastYieldedIndex < this._currentStreamChunks.length) {
            yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {
              additionalKwargs: { chunk: true }
            });
            lastYieldedIndex++;
          }
          break;
        }
      }

      // Wait for completion
      await responsePromise;

      // Clean up
      delete this._currentStreamChunks;

    } catch (error) {
      throw new Error(`Streaming failed: ${error.message}`);
    }
  }
}
```

**Streaming challenges**:
- `onTextChunk` is a synchronous callback
- Can't yield directly from callback
- Use polling mechanism to yield as chunks arrive
- 10ms polling interval balances responsiveness vs CPU usage

## Real-World Examples

### Example 1: Simple Text Generation

```javascript
const llm = new LlamaCppLLM({
  modelPath: './models/Meta-Llama-3.1-8B-Instruct-Q5_K_S.gguf',
  temperature: 0.7,
  maxTokens: 100
});

// Simple string input
const response = await llm.invoke("What is 2+2?");
console.log(response.content); // "2+2 equals 4."
```

### Example 2: Conversation with System Prompt

```javascript
const messages = [
  new SystemMessage("You are a helpful math tutor."),
  new HumanMessage("What is 5*5?")
];

const response = await llm.invoke(messages);
console.log(response.content);
// "5 times 5 is 25. Here's a simple explanation..."
```

### Example 3: Using Qwen Chat Wrapper

```javascript
import { QwenChatWrapper } from 'node-llama-cpp';

const llm = new LlamaCppLLM({
  modelPath: './models/Qwen3-1.7B-Q6_K.gguf',
  temperature: 0.7,
  chatWrapper: new QwenChatWrapper({
    thoughts: 'discourage'  // Prevents thinking tokens
  })
});

const response = await llm.invoke("What is AI?");
// Response won't include <think> tokens
```

### Example 4: Temperature Comparison

```javascript
const question = "Give me one adjective to describe winter:";

// Low temperature - consistent answers
llm._chatSession.setChatHistory([]);
const lowTemp = await llm.invoke(question, { temperature: 0.1 });
// Likely: "cold"

// High temperature - varied answers  
llm._chatSession.setChatHistory([]);
const highTemp = await llm.invoke(question, { temperature: 0.9 });
// Could be: "frosty", "snowy", "icy", "chilly"
```

### Example 5: Streaming Output

```javascript
console.log('Response: ');
for await (const chunk of llm.stream("Tell me a fun fact about space")) {
  process.stdout.write(chunk.content); // No newline
}
console.log('\n');

// Output streams in real-time as it's generated
```

### Example 6: Batch Processing

```javascript
const questions = [
  "What is Python?",
  "What is JavaScript?",
  "What is Rust?"
];

const answers = await llm.batch(questions);

questions.forEach((q, i) => {
  console.log(`Q: ${q}`);
  console.log(`A: ${answers[i].content}`);
  console.log();
});

// Each answer is independent - no history contamination!
```

### Example 7: Using in a Pipeline

```javascript
import { PromptTemplate } from '../prompts/prompt-template.js';

const prompt = PromptTemplate.fromTemplate(
  "Translate the following to {language}: {text}"
);

const chain = prompt.pipe(llm);

const result = await chain.invoke({
  language: "Spanish",
  text: "Hello, how are you?"
});

console.log(result.content); // "Hola, ¿cómo estás?"
```

## Advanced Patterns

### Pattern 1: Model Pool (Reusing Loaded Models)

```javascript
class LLMPool {
  constructor() {
    this.models = new Map();
  }

  async get(modelPath, options = {}) {
    if (!this.models.has(modelPath)) {
      const llm = new LlamaCppLLM({ modelPath, ...options });
      await llm._initialize(); // Pre-load
      this.models.set(modelPath, llm);
    }
    return this.models.get(modelPath);
  }

  async disposeAll() {
    for (const llm of this.models.values()) {
      await llm.dispose();
    }
    this.models.clear();
  }
}

// Usage
const pool = new LLMPool();
const llm = await pool.get('./models/llama-3.1-8b.gguf');
```

### Pattern 2: Retry on Failure

```javascript
class ReliableLLM extends LlamaCppLLM {
  async _call(input, config = {}) {
    const maxRetries = config.maxRetries || 3;
    let lastError;

    for (let i = 0; i < maxRetries; i++) {
      try {
        return await super._call(input, config);
      } catch (error) {
        lastError = error;
        console.warn(`Attempt ${i + 1} failed, retrying...`);
        await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1)));
      }
    }

    throw new Error(`All ${maxRetries} attempts failed: ${lastError.message}`);
  }
}
```

### Pattern 3: Token Counting

```javascript
class LlamaCppLLMWithCounting extends LlamaCppLLM {
  constructor(options) {
    super(options);
    this.totalTokens = 0;
  }

  async _call(input, config = {}) {
    const result = await super._call(input, config);

    // Rough token estimation (4 chars ≈ 1 token)
    const promptTokens = Math.ceil(JSON.stringify(input).length / 4);
    const completionTokens = Math.ceil(result.content.length / 4);

    this.totalTokens += promptTokens + completionTokens;

    result.additionalKwargs.usage = {
      promptTokens,
      completionTokens,
      totalTokens: promptTokens + completionTokens
    };

    return result;
  }

  getUsage() {
    return { totalTokens: this.totalTokens };
  }
}
```

## Common Patterns and Best Practices

### ✅ DO:

```javascript
// Initialize once, use many times
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
await llm.invoke("Question 1");
await llm.invoke("Question 2");
await llm.dispose(); // Cleanup when done

// Use Messages for structure
const messages = [
  new SystemMessage("You are helpful"),
  new HumanMessage("Hi")
];

// Clear history for independent calls
const response = await llm.invoke(messages, { clearHistory: true });

// Handle errors gracefully
try {
  const result = await llm.invoke(messages);
} catch (error) {
  console.error('Generation failed:', error);
}
```

### ❌ DON'T:

```javascript
// Don't create new LLM for each request (slow!)
for (const question of questions) {
  const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
  await llm.invoke(question); // Loads model every time!
}

// Don't forget to clear history in batch processing
// This will cause history contamination!
for (const q of questions) {
  await llm.invoke(q); // Sees all previous questions!
}

// Don't forget cleanup
// Missing: await llm.dispose()
```

## Performance Tips

### Tip 1: Preload Models

```javascript
// Load during app startup
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
await llm._initialize(); // Force load now

// Later requests are instant
await llm.invoke("Fast response!");
```

### Tip 2: Use Batch Properly

```javascript
// This correctly isolates each question
const answers = await llm.batch(questions);

// Not: Sequential with contamination
for (const q of questions) {
  await llm.invoke(q); // History builds up!
}
```

### Tip 3: Adjust Context Size

```javascript
// Smaller context = faster, less memory
const fastLLM = new LlamaCppLLM({
  modelPath: './model.gguf',
  contextSize: 2048  // vs default 4096
});
```

### Tip 4: Use Appropriate Temperature

```javascript
// Factual answers: low temperature
const fact = await llm.invoke(query, { temperature: 0.1 });

// Creative writing: high temperature
const story = await llm.invoke(query, { temperature: 0.9 });
```

## Debugging Tips

### Tip 1: Enable Verbose Mode

```javascript
const llm = new LlamaCppLLM({
  modelPath: './model.gguf',
  verbose: true  // Shows loading and generation details
});
```

### Tip 2: Test History Isolation

```javascript
// Test batch processing
const questions = ["Q1", "Q2", "Q3"];
const answers = await llm.batch(questions);

// Each answer should be independent
// If Q2 mentions Q1, history contamination occurred!
```

### Tip 3: Verify Streaming

```javascript
// Verify streaming works
console.log('Testing stream:');
for await (const chunk of llm.stream("Count to 5")) {
  console.log('Chunk:', chunk.content);
}
```

## Common Mistakes

### ❌ Mistake 1: Not Clearing History in Batches

```javascript
// Bad: History contamination
const pipeline = formatter.pipe(llm).pipe(parser);
const results = await pipeline.batch(inputs); // Q2 sees Q1!
```

**Fix**: The LlamaCppLLM.batch() method automatically clears history:
```javascript
// Good: Each input is isolated
const results = await llm.batch(inputs);
```

### ❌ Mistake 2: Forgetting Random Seed

```javascript
// Bad: Temperature doesn't work
const response = await llm.invoke(prompt, { temperature: 0.9 });
// Without random seed, might get same answer
```

**Fix**: Our implementation automatically adds random seed:
```javascript
// Good: Randomness works properly
if (promptOptions.temperature > 0 && config.seed === undefined) {
  promptOptions.seed = Math.floor(Math.random() * 1000000);
}
```

### ❌ Mistake 3: Not Setting System Prompt Properly

```javascript
// Bad: System prompt persists between calls
await llm.invoke([new SystemMessage("Be creative"), ...]);
await llm.invoke([new HumanMessage("Hi")]); // Still "creative"!
```

**Fix**: Always set system prompt (empty string to clear):
```javascript
// Good: Always explicitly set or clear
this._chatSession.systemPrompt = systemPrompt || '';
```

## Mental Model

Think of the LLM wrapper as managing a conversation session:

```
Call 1: [System: "Be helpful", User: "Hi"]
        ↓
      Model generates response
        ↓
      Returns: AIMessage("Hello!")

Call 2: [User: "How are you?"]  
        ↓
      PROBLEM: Still has "Be helpful" system prompt!
      PROBLEM: Might remember "Hi" conversation!
        
SOLUTION: Clear history + reset system prompt between calls
```

The wrapper handles:
- Loading models once
- Converting Messages to chat history
- Managing system prompts
- Clearing history when needed
- Streaming chunks
- Random seeds for temperature
- Error handling

## Summary

Congratulations! You now understand how to wrap a complex LLM library as a clean, composable Runnable with proper state management.

### Key Takeaways

1. **Lazy loading saves time**: Load models only when needed
2. **Messages enable structure**: Proper conversation formatting
3. **History isolation prevents bugs**: Critical for batch processing
4. **System prompts must be managed**: Always set or clear explicitly
5. **Streaming improves UX**: Real-time output feels responsive
6. **Random seeds enable temperature**: Required for randomness
7. **Chat wrappers add flexibility**: Support different models
8. **Sequential batch processing**: Local models can't truly parallelize

### What You Built

A LLM wrapper that:
- ✅ Loads models lazily
- ✅ Handles Messages properly
- ✅ Manages chat history correctly
- ✅ Isolates batches
- ✅ Supports streaming
- ✅ Handles system prompts
- ✅ Supports chat wrappers
- ✅ Adds random seeds for temperature
- ✅ Provides good error messages

### Critical Implementation Details

```javascript
// 1. Always clear and set system prompt
this._chatSession.systemPrompt = systemPrompt || '';

// 2. Use clearHistory for batch isolation
async batch(inputs, config = {}) {
  const results = [];
  for (const input of inputs) {
    const result = await this._call(input, { 
      ...config, 
      clearHistory: true 
    });
    results.push(result);
  }
  return results;
}

// 3. Add random seed for temperature
if (promptOptions.temperature > 0 && config.seed === undefined) {
  promptOptions.seed = Math.floor(Math.random() * 1000000);
}

// 4. Use correct parameter names
customStopTriggers: config.stopStrings ?? this.stopStrings
```

### What's Next

In the next lesson, we'll explore **Context & Configuration** - how to pass state and settings through chains.

**Preview**: You'll learn:
- RunnableConfig object
- Callback systems
- Metadata tracking
- Debug modes

➡️ [Continue to Lesson 4: Context & Configuration](04-context.md)

## Additional Resources

- [node-llama-cpp Documentation](https://node-llama-cpp.withcat.ai)
- [Chat Wrappers Guide](https://node-llama-cpp.withcat.ai/guide/chat-wrapper)
- [Temperature Guide](https://node-llama-cpp.withcat.ai/guide/chat-session#temperature)
- [GGUF Model Format](https://huggingface.co/docs/hub/gguf)

## Questions & Discussion

**Q: Why do we always set system prompt instead of only when present?**

A: To prevent contamination. If call 1 sets a system prompt but call 2 doesn't, call 2 would still use call 1's system prompt. Always setting (even to empty string) ensures clean state.

**Q: Why sequential batch processing instead of parallel?**

A: Local models (node-llama-cpp) can't run true parallel inference on a single model instance. The library serializes requests internally, so parallel Promise.all() provides no benefit and can cause race conditions on the shared chat session.

**Q: Why do we need random seeds for temperature?**

A: The node-llama-cpp library states: "The randomness of the temperature can be controlled by the seed parameter. Setting a specific seed and a specific temperature will yield the same response every time for the same input." Without a random seed, high temperature might still give deterministic results.

**Q: Can I use multiple models simultaneously?**

A: Yes! Each LlamaCppLLM instance can have a different model. Just be aware of memory constraints - each model takes several GB of RAM.

**Q: What's the difference between customStopTriggers and stopStrings?**

A: `customStopTriggers` is the correct parameter name in node-llama-cpp. We accept `stopStrings` in our config for a more intuitive API, then map it to `customStopTriggers` internally.

---

**Built with ❤️ for learners who want to understand AI agents deeply**

[← Previous: Messages](02-messages.md) | [Tutorial Index](../README.md) | [Next: Context →](04-context.md)