Spaces:

lenzcom
/

Email

Running

App Files Files Community

Email / tutorial /01-foundation /03-llm-wrapper /lesson.md

lenzcom

Upload folder using huggingface_hub

e706de2 verified 1 day ago

preview code

raw

history blame contribute delete

30.5 kB

	# The LLM Wrapper

	Part 1: Foundation - Lesson 3

	> Wrapping node-llama-cpp as a Runnable for seamless integration

	## Overview

	In Lesson 1, you learned about Runnables - the composable interface. In Lesson 2, you mastered Messages - the data structures. Now we'll connect these concepts by wrapping node-llama-cpp (our local LLM) as a Runnable that understands Messages.

	By the end of this lesson, you'll have a LLM wrapper that can generate text, handle conversations, stream responses, and integrate seamlessly with chains.

	## Why Does This Matter?

	### The Problem: LLMs Don't Compose

	node-llama-cpp is excellent at what it does - running local LLMs efficiently. But when you're building agents, you need more than just an LLM. You need components that work together seamlessly.

	Without a composable framework:
	```javascript
	import { getLlama } from 'node-llama-cpp';

	// Each component is isolated - they don't know about each other
	async function myAgent(userInput) {
	// Step 1: Format the prompt
	const prompt = myCustomFormatter(userInput);

	// Step 2: Call the LLM
	const llama = await getLlama();
	const model = await llama.loadModel({ modelPath: './model.gguf' });
	const response = await model.createCompletion(prompt);

	// Step 3: Parse the response
	const parsed = myCustomParser(response);

	// Step 4: Maybe call a tool?
	if (parsed.needsTool) {
	const toolResult = await myTool(parsed.args);
	// Now what? Call the LLM again? How do we loop?
	// How do we add logging? Memory? Retries?
	}

	return parsed;
	}

	// Problems:
	// - Can't reuse components
	// - Can't chain operations
	// - Hard to add logging, metrics, or debugging
	// - Complex control flow for agents
	// - Every new feature requires changing everything
	```

	With a composable framework:
	```javascript
	// Components that work together
	const llm = new LlamaCppLLM({ modelPath: './model.gguf' });

	// Simple usage
	const response = await llm.invoke([
	new SystemMessage("You are helpful"),
	new HumanMessage("Hi")
	]);
	// Returns: AIMessage("Hello! How can I help you?")

	// But the real power is composition
	const agent = promptTemplate
	.pipe(llm)
	.pipe(outputParser)
	.pipe(toolExecutor);

	// Now you can:
	// ✅ Reuse components in different chains
	// ✅ Add logging with callbacks (no code changes)
	// ✅ Build complex agents that use tools
	// ✅ Test each component independently
	// ✅ Swap LLMs without rewriting everything
	```

	### What the Wrapper Provides

	The LLM wrapper isn't about making node-llama-cpp easier - it's about making it work with everything else:

	1. Common Interface: Same `invoke()` / `stream()` / `batch()` as every other component
	2. Message Support: Understands HumanMessage, AIMessage, SystemMessage
	3. Composability: Works with `.pipe()` to chain operations
	4. Observability: Callbacks work automatically for logging/metrics
	5. Configuration: Runtime settings pass through cleanly
	6. History Isolation: Proper batch processing without contamination

	Think of it as an adapter that lets node-llama-cpp play nicely with the rest of your agent system.

	## Learning Objectives

	By the end of this lesson, you will:

	- ✅ Understand how to wrap complex libraries as Runnables
	- ✅ Convert Messages to LLM chat history
	- ✅ Handle model loading and lifecycle
	- ✅ Implement streaming for real-time output
	- ✅ Add temperature and other generation parameters
	- ✅ Manage context windows and chat history
	- ✅ Handle batch processing with history isolation

	## Core Concepts

	### What is an LLM Wrapper?

	An LLM wrapper is an abstraction layer that:
	1. Hides complexity - No need to manage contexts, sessions, or cleanup
	2. Provides a standard interface - Same API regardless of underlying model
	3. Handles conversion - Transforms Messages into model-specific chat history
	4. Manages resources - Automatic initialization and cleanup
	5. Enables composition - Works seamlessly in chains
	6. Isolates state - Prevents history contamination in batch processing

	### The Wrapper's Responsibilities

	```
	Input (Messages)
	↓
	[1. Convert to Chat History]
	↓
	[2. Manage System Prompt]
	↓
	[3. Call LLM]
	↓
	[4. Parse Response]
	↓
	Output (AIMessage)
	```

	### Key Challenges

	1. Model Loading: Models are large and slow to load
	2. Chat History Format: Convert Messages to node-llama-cpp format
	3. System Prompt Management: Clear and set for each call
	4. Context Management: Limited context windows
	5. Streaming: Real-time output is complex
	6. Batch Isolation: Prevent history contamination
	7. Error Handling: Models can fail in various ways
	8. Chat Wrappers: Different models need different formats

	## Implementation Deep Dive

	Let's build the LLM wrapper step by step.

	### Step 1: The Base Structure

	Location: `src/llm/llama-cpp-llm.js`
	```javascript
	import { Runnable } from './runnable.js';
	import { AIMessage } from './message.js';
	import { getLlama, LlamaChatSession } from 'node-llama-cpp';

	export class LlamaCppLLM extends Runnable {
	constructor(options = {}) {
	super();

	// Model configuration
	this.modelPath = options.modelPath;
	this.temperature = options.temperature ?? 0.7;
	this.maxTokens = options.maxTokens ?? 2048;
	this.contextSize = options.contextSize ?? 4096;

	// Chat wrapper configuration (auto-detects by default)
	this.chatWrapper = options.chatWrapper ?? 'auto';

	// Internal state
	this._llama = null;
	this._model = null;
	this._context = null;
	this._chatSession = null;
	this._initialized = false;
	}

	async _call(input, config) {
	// Will implement next
	}
	}
	```

	Key decisions:
	- Stores configuration (temperature, max tokens, etc.)
	- Supports custom chat wrappers (e.g., QwenChatWrapper)
	- Tracks internal state (model, context, session)
	- Lazy initialization (load on first use)

	### Step 2: Model Initialization with Chat Wrapper Support

	```javascript
	export class LlamaCppLLM extends Runnable {
	// ... constructor ...

	/**
	* Initialize the model (lazy loading)
	*/
	async _initialize() {
	if (this._initialized) return;

	if (this.verbose) {
	console.log(`Loading model: ${this.modelPath}`);
	}

	try {
	// Step 1: Get llama instance
	this._llama = await getLlama();

	// Step 2: Load the model
	this._model = await this._llama.loadModel({
	modelPath: this.modelPath
	});

	// Step 3: Create context (working memory)
	this._context = await this._model.createContext({
	contextSize: this.contextSize,
	batchSize: this.batchSize
	});

	// Step 4: Create chat session with optional chat wrapper
	const contextSequence = this._context.getSequence();
	const sessionConfig = { contextSequence };

	// Add custom chat wrapper if specified
	if (this.chatWrapper !== 'auto') {
	sessionConfig.chatWrapper = this.chatWrapper;
	}

	this._chatSession = new LlamaChatSession(sessionConfig);

	this._initialized = true;

	if (this.verbose) {
	console.log('✓ Model loaded successfully');
	if (this.chatWrapper !== 'auto') {
	console.log(`✓ Using custom chat wrapper: ${this.chatWrapper.constructor.name}`);
	}
	}
	} catch (error) {
	throw new Error(
	`Failed to initialize model at ${this.modelPath}: ${error.message}`
	);
	}
	}

	/**
	* Cleanup resources
	*/
	async dispose() {
	if (this._context) {
	await this._context.dispose();
	this._context = null;
	}
	if (this._model) {
	await this._model.dispose();
	this._model = null;
	}
	this._chatSession = null;
	this._initialized = false;

	if (this.verbose) {
	console.log('✓ Model resources disposed');
	}
	}
	}
	```

	Why lazy loading?
	- Models take 5-30 seconds to load
	- Don't load until actually needed
	- Share one loaded model across multiple calls

	Chat Wrapper Support:
	- Defaults to 'auto' (library auto-detects)
	- Supports custom wrappers like QwenChatWrapper for specific models
	- Useful for controlling model behavior (e.g., discouraging thoughts)

	### Step 3: Converting Messages to Chat History

	```javascript
	export class LlamaCppLLM extends Runnable {
	// ... previous code ...

	/**
	* Convert our Message objects to node-llama-cpp chat history format
	*/
	_messagesToChatHistory(messages) {
	return messages.map(msg => {
	// System messages: instructions for the AI
	if (msg._type === 'system') {
	return { type: 'system', text: msg.content };
	}
	// Human messages: user input
	else if (msg._type === 'human') {
	return { type: 'user', text: msg.content };
	}
	// AI messages: previous AI responses
	else if (msg._type === 'ai') {
	return { type: 'model', response: msg.content };
	}
	// Tool messages: results from tool execution
	else if (msg._type === 'tool') {
	return { type: 'system', text: `Tool Result: ${msg.content}` };
	}

	// Fallback: treat unknown types as user messages
	return { type: 'user', text: msg.content };
	});
	}
	}
	```

	Key insight: This bridges between your standardized Message types and what node-llama-cpp expects. Different models may need different chat formats, which is why chat wrappers exist.

	### Step 4: The Main Generation Method

	```javascript
	export class LlamaCppLLM extends Runnable {
	// ... previous code ...

	async _call(input, config = {}) {
	// Initialize if needed
	await this._initialize();

	// Clear history if requested (important for batch processing)
	if (config.clearHistory) {
	this._chatSession.setChatHistory([]);
	}

	// Handle different input types
	let messages;
	if (typeof input === 'string') {
	messages = [new HumanMessage(input)];
	} else if (Array.isArray(input)) {
	messages = input;
	} else {
	throw new Error('Input must be string or array of messages');
	}

	// Extract system message if present
	const systemMessages = messages.filter(msg => msg._type === 'system');
	const systemPrompt = systemMessages.length > 0
	? systemMessages[0].content
	: '';

	// Convert our Message objects to llama.cpp format
	const chatHistory = this._messagesToChatHistory(messages);
	this._chatSession.setChatHistory(chatHistory);

	// ALWAYS set system prompt (either new value or empty string to clear)
	this._chatSession.systemPrompt = systemPrompt;

	try {
	// Build prompt options
	const promptOptions = {
	temperature: config.temperature ?? this.temperature,
	topP: config.topP ?? this.topP,
	topK: config.topK ?? this.topK,
	maxTokens: config.maxTokens ?? this.maxTokens,
	repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,
	customStopTriggers: config.stopStrings ?? this.stopStrings
	};

	// Add random seed if temperature > 0 and no seed specified
	// This ensures randomness works properly
	if (promptOptions.temperature > 0 && config.seed === undefined) {
	promptOptions.seed = Math.floor(Math.random() * 1000000);
	} else if (config.seed !== undefined) {
	promptOptions.seed = config.seed;
	}

	// Generate response using prompt
	const response = await this._chatSession.prompt('', promptOptions);

	// Return as AIMessage for consistency
	return new AIMessage(response);
	} catch (error) {
	throw new Error(`Generation failed: ${error.message}`);
	}
	}
	}
	```

	Critical details:
	- Always clears and sets system prompt (prevents contamination)
	- Adds random seed for proper temperature behavior
	- Uses `customStopTriggers` (correct parameter name)
	- Supports `clearHistory` for batch processing

	### Step 5: Batch Processing with History Isolation

	```javascript
	export class LlamaCppLLM extends Runnable {
	// ... previous code ...

	/**
	* Batch processing with history isolation
	*
	* Processes multiple inputs sequentially, ensuring each gets
	* a clean chat history to prevent contamination.
	*/
	async batch(inputs, config = {}) {
	const results = [];
	for (const input of inputs) {
	// Clear history before each batch item
	const result = await this._call(input, {
	...config,
	clearHistory: true
	});
	results.push(result);
	}
	return results;
	}
	}
	```

	Why sequential processing?
	- Local models can't run truly in parallel
	- Sequential ensures proper history isolation
	- Each item gets a clean slate

	### Step 6: Streaming Support

	For real-time output (like ChatGPT's typing effect):

	```javascript
	export class LlamaCppLLM extends Runnable {
	// ... previous code ...

	async *_stream(input, config = {}) {
	await this._initialize();

	// Clear history if requested
	if (config.clearHistory) {
	this._chatSession.setChatHistory([]);
	}

	// Handle input types (same as _call)
	let messages;
	if (typeof input === 'string') {
	messages = [new HumanMessage(input)];
	} else if (Array.isArray(input)) {
	messages = input;
	} else {
	throw new Error('Input must be string or array of messages');
	}

	// Extract system message
	const systemMessages = messages.filter(msg => msg._type === 'system');
	const systemPrompt = systemMessages.length > 0
	? systemMessages[0].content
	: '';

	// Set up chat history
	const chatHistory = this._messagesToChatHistory(messages);
	this._chatSession.setChatHistory(chatHistory);

	// ALWAYS set system prompt
	this._chatSession.systemPrompt = systemPrompt;

	try {
	// Build prompt options
	const promptOptions = {
	temperature: config.temperature ?? this.temperature,
	topP: config.topP ?? this.topP,
	topK: config.topK ?? this.topK,
	maxTokens: config.maxTokens ?? this.maxTokens,
	repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,
	customStopTriggers: config.stopStrings ?? this.stopStrings
	};

	// Add random seed
	if (promptOptions.temperature > 0 && config.seed === undefined) {
	promptOptions.seed = Math.floor(Math.random() * 1000000);
	} else if (config.seed !== undefined) {
	promptOptions.seed = config.seed;
	}

	// Use onTextChunk callback to collect chunks
	const self = this;
	promptOptions.onTextChunk = (chunk) => {
	self._currentStreamChunks = self._currentStreamChunks \|\| [];
	self._currentStreamChunks.push(chunk);
	};

	// Initialize chunk collection
	this._currentStreamChunks = [];

	// Start generation
	const responsePromise = this._chatSession.prompt('', promptOptions);

	// Yield chunks as they become available
	let lastYieldedIndex = 0;

	// Poll for new chunks
	while (true) {
	// Yield any new chunks
	while (lastYieldedIndex < this._currentStreamChunks.length) {
	yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {
	additionalKwargs: { chunk: true }
	});
	lastYieldedIndex++;
	}

	// Check if generation is complete
	const isDone = await Promise.race([
	responsePromise.then(() => true),
	new Promise(resolve => setTimeout(() => resolve(false), 10))
	]);

	if (isDone) {
	// Yield any remaining chunks
	while (lastYieldedIndex < this._currentStreamChunks.length) {
	yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {
	additionalKwargs: { chunk: true }
	});
	lastYieldedIndex++;
	}
	break;
	}
	}

	// Wait for completion
	await responsePromise;

	// Clean up
	delete this._currentStreamChunks;

	} catch (error) {
	throw new Error(`Streaming failed: ${error.message}`);
	}
	}
	}
	```

	Streaming challenges:
	- `onTextChunk` is a synchronous callback
	- Can't yield directly from callback
	- Use polling mechanism to yield as chunks arrive
	- 10ms polling interval balances responsiveness vs CPU usage

	## Real-World Examples

	### Example 1: Simple Text Generation

	```javascript
	const llm = new LlamaCppLLM({
	modelPath: './models/Meta-Llama-3.1-8B-Instruct-Q5_K_S.gguf',
	temperature: 0.7,
	maxTokens: 100
	});

	// Simple string input
	const response = await llm.invoke("What is 2+2?");
	console.log(response.content); // "2+2 equals 4."
	```

	### Example 2: Conversation with System Prompt

	```javascript
	const messages = [
	new SystemMessage("You are a helpful math tutor."),
	new HumanMessage("What is 5*5?")
	];

	const response = await llm.invoke(messages);
	console.log(response.content);
	// "5 times 5 is 25. Here's a simple explanation..."
	```

	### Example 3: Using Qwen Chat Wrapper

	```javascript
	import { QwenChatWrapper } from 'node-llama-cpp';

	const llm = new LlamaCppLLM({
	modelPath: './models/Qwen3-1.7B-Q6_K.gguf',
	temperature: 0.7,
	chatWrapper: new QwenChatWrapper({
	thoughts: 'discourage' // Prevents thinking tokens
	})
	});

	const response = await llm.invoke("What is AI?");
	// Response won't include <think> tokens
	```

	### Example 4: Temperature Comparison

	```javascript
	const question = "Give me one adjective to describe winter:";

	// Low temperature - consistent answers
	llm._chatSession.setChatHistory([]);
	const lowTemp = await llm.invoke(question, { temperature: 0.1 });
	// Likely: "cold"

	// High temperature - varied answers
	llm._chatSession.setChatHistory([]);
	const highTemp = await llm.invoke(question, { temperature: 0.9 });
	// Could be: "frosty", "snowy", "icy", "chilly"
	```

	### Example 5: Streaming Output

	```javascript
	console.log('Response: ');
	for await (const chunk of llm.stream("Tell me a fun fact about space")) {
	process.stdout.write(chunk.content); // No newline
	}
	console.log('\n');

	// Output streams in real-time as it's generated
	```

	### Example 6: Batch Processing

	```javascript
	const questions = [
	"What is Python?",
	"What is JavaScript?",
	"What is Rust?"
	];

	const answers = await llm.batch(questions);

	questions.forEach((q, i) => {
	console.log(`Q: ${q}`);
	console.log(`A: ${answers[i].content}`);
	console.log();
	});

	// Each answer is independent - no history contamination!
	```

	### Example 7: Using in a Pipeline

	```javascript
	import { PromptTemplate } from '../prompts/prompt-template.js';

	const prompt = PromptTemplate.fromTemplate(
	"Translate the following to {language}: {text}"
	);

	const chain = prompt.pipe(llm);

	const result = await chain.invoke({
	language: "Spanish",
	text: "Hello, how are you?"
	});

	console.log(result.content); // "Hola, ¿cómo estás?"
	```

	## Advanced Patterns

	### Pattern 1: Model Pool (Reusing Loaded Models)

	```javascript
	class LLMPool {
	constructor() {
	this.models = new Map();
	}

	async get(modelPath, options = {}) {
	if (!this.models.has(modelPath)) {
	const llm = new LlamaCppLLM({ modelPath, ...options });
	await llm._initialize(); // Pre-load
	this.models.set(modelPath, llm);
	}
	return this.models.get(modelPath);
	}

	async disposeAll() {
	for (const llm of this.models.values()) {
	await llm.dispose();
	}
	this.models.clear();
	}
	}

	// Usage
	const pool = new LLMPool();
	const llm = await pool.get('./models/llama-3.1-8b.gguf');
	```

	### Pattern 2: Retry on Failure

	```javascript
	class ReliableLLM extends LlamaCppLLM {
	async _call(input, config = {}) {
	const maxRetries = config.maxRetries \|\| 3;
	let lastError;

	for (let i = 0; i < maxRetries; i++) {
	try {
	return await super._call(input, config);
	} catch (error) {
	lastError = error;
	console.warn(`Attempt ${i + 1} failed, retrying...`);
	await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1)));
	}
	}

	throw new Error(`All ${maxRetries} attempts failed: ${lastError.message}`);
	}
	}
	```

	### Pattern 3: Token Counting

	```javascript
	class LlamaCppLLMWithCounting extends LlamaCppLLM {
	constructor(options) {
	super(options);
	this.totalTokens = 0;
	}

	async _call(input, config = {}) {
	const result = await super._call(input, config);

	// Rough token estimation (4 chars ≈ 1 token)
	const promptTokens = Math.ceil(JSON.stringify(input).length / 4);
	const completionTokens = Math.ceil(result.content.length / 4);

	this.totalTokens += promptTokens + completionTokens;

	result.additionalKwargs.usage = {
	promptTokens,
	completionTokens,
	totalTokens: promptTokens + completionTokens
	};

	return result;
	}

	getUsage() {
	return { totalTokens: this.totalTokens };
	}
	}
	```

	## Common Patterns and Best Practices

	### ✅ DO:

	```javascript
	// Initialize once, use many times
	const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
	await llm.invoke("Question 1");
	await llm.invoke("Question 2");
	await llm.dispose(); // Cleanup when done

	// Use Messages for structure
	const messages = [
	new SystemMessage("You are helpful"),
	new HumanMessage("Hi")
	];

	// Clear history for independent calls
	const response = await llm.invoke(messages, { clearHistory: true });

	// Handle errors gracefully
	try {
	const result = await llm.invoke(messages);
	} catch (error) {
	console.error('Generation failed:', error);
	}
	```

	### ❌ DON'T:

	```javascript
	// Don't create new LLM for each request (slow!)
	for (const question of questions) {
	const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
	await llm.invoke(question); // Loads model every time!
	}

	// Don't forget to clear history in batch processing
	// This will cause history contamination!
	for (const q of questions) {
	await llm.invoke(q); // Sees all previous questions!
	}

	// Don't forget cleanup
	// Missing: await llm.dispose()
	```

	## Performance Tips

	### Tip 1: Preload Models

	```javascript
	// Load during app startup
	const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
	await llm._initialize(); // Force load now

	// Later requests are instant
	await llm.invoke("Fast response!");
	```

	### Tip 2: Use Batch Properly

	```javascript
	// This correctly isolates each question
	const answers = await llm.batch(questions);

	// Not: Sequential with contamination
	for (const q of questions) {
	await llm.invoke(q); // History builds up!
	}
	```

	### Tip 3: Adjust Context Size

	```javascript
	// Smaller context = faster, less memory
	const fastLLM = new LlamaCppLLM({
	modelPath: './model.gguf',
	contextSize: 2048 // vs default 4096
	});
	```

	### Tip 4: Use Appropriate Temperature

	```javascript
	// Factual answers: low temperature
	const fact = await llm.invoke(query, { temperature: 0.1 });

	// Creative writing: high temperature
	const story = await llm.invoke(query, { temperature: 0.9 });
	```

	## Debugging Tips

	### Tip 1: Enable Verbose Mode

	```javascript
	const llm = new LlamaCppLLM({
	modelPath: './model.gguf',
	verbose: true // Shows loading and generation details
	});
	```

	### Tip 2: Test History Isolation

	```javascript
	// Test batch processing
	const questions = ["Q1", "Q2", "Q3"];
	const answers = await llm.batch(questions);

	// Each answer should be independent
	// If Q2 mentions Q1, history contamination occurred!
	```

	### Tip 3: Verify Streaming

	```javascript
	// Verify streaming works
	console.log('Testing stream:');
	for await (const chunk of llm.stream("Count to 5")) {
	console.log('Chunk:', chunk.content);
	}
	```

	## Common Mistakes

	### ❌ Mistake 1: Not Clearing History in Batches

	```javascript
	// Bad: History contamination
	const pipeline = formatter.pipe(llm).pipe(parser);
	const results = await pipeline.batch(inputs); // Q2 sees Q1!
	```

	Fix: The LlamaCppLLM.batch() method automatically clears history:
	```javascript
	// Good: Each input is isolated
	const results = await llm.batch(inputs);
	```

	### ❌ Mistake 2: Forgetting Random Seed

	```javascript
	// Bad: Temperature doesn't work
	const response = await llm.invoke(prompt, { temperature: 0.9 });
	// Without random seed, might get same answer
	```

	Fix: Our implementation automatically adds random seed:
	```javascript
	// Good: Randomness works properly
	if (promptOptions.temperature > 0 && config.seed === undefined) {
	promptOptions.seed = Math.floor(Math.random() * 1000000);
	}
	```

	### ❌ Mistake 3: Not Setting System Prompt Properly

	```javascript
	// Bad: System prompt persists between calls
	await llm.invoke([new SystemMessage("Be creative"), ...]);
	await llm.invoke([new HumanMessage("Hi")]); // Still "creative"!
	```

	Fix: Always set system prompt (empty string to clear):
	```javascript
	// Good: Always explicitly set or clear
	this._chatSession.systemPrompt = systemPrompt \|\| '';
	```

	## Mental Model

	Think of the LLM wrapper as managing a conversation session:

	```
	Call 1: [System: "Be helpful", User: "Hi"]
	↓
	Model generates response
	↓
	Returns: AIMessage("Hello!")

	Call 2: [User: "How are you?"]
	↓
	PROBLEM: Still has "Be helpful" system prompt!
	PROBLEM: Might remember "Hi" conversation!

	SOLUTION: Clear history + reset system prompt between calls
	```

	The wrapper handles:
	- Loading models once
	- Converting Messages to chat history
	- Managing system prompts
	- Clearing history when needed
	- Streaming chunks
	- Random seeds for temperature
	- Error handling

	## Summary

	Congratulations! You now understand how to wrap a complex LLM library as a clean, composable Runnable with proper state management.

	### Key Takeaways

	1. Lazy loading saves time: Load models only when needed
	2. Messages enable structure: Proper conversation formatting
	3. History isolation prevents bugs: Critical for batch processing
	4. System prompts must be managed: Always set or clear explicitly
	5. Streaming improves UX: Real-time output feels responsive
	6. Random seeds enable temperature: Required for randomness
	7. Chat wrappers add flexibility: Support different models
	8. Sequential batch processing: Local models can't truly parallelize

	### What You Built

	A LLM wrapper that:
	- ✅ Loads models lazily
	- ✅ Handles Messages properly
	- ✅ Manages chat history correctly
	- ✅ Isolates batches
	- ✅ Supports streaming
	- ✅ Handles system prompts
	- ✅ Supports chat wrappers
	- ✅ Adds random seeds for temperature
	- ✅ Provides good error messages

	### Critical Implementation Details

	```javascript
	// 1. Always clear and set system prompt
	this._chatSession.systemPrompt = systemPrompt \|\| '';

	// 2. Use clearHistory for batch isolation
	async batch(inputs, config = {}) {
	const results = [];
	for (const input of inputs) {
	const result = await this._call(input, {
	...config,
	clearHistory: true
	});
	results.push(result);
	}
	return results;
	}

	// 3. Add random seed for temperature
	if (promptOptions.temperature > 0 && config.seed === undefined) {
	promptOptions.seed = Math.floor(Math.random() * 1000000);
	}

	// 4. Use correct parameter names
	customStopTriggers: config.stopStrings ?? this.stopStrings
	```

	### What's Next

	In the next lesson, we'll explore Context & Configuration - how to pass state and settings through chains.

	Preview: You'll learn:
	- RunnableConfig object
	- Callback systems
	- Metadata tracking
	- Debug modes

	➡️ [Continue to Lesson 4: Context & Configuration](04-context.md)

	## Additional Resources

	- [node-llama-cpp Documentation](https://node-llama-cpp.withcat.ai)
	- [Chat Wrappers Guide](https://node-llama-cpp.withcat.ai/guide/chat-wrapper)
	- [Temperature Guide](https://node-llama-cpp.withcat.ai/guide/chat-session#temperature)
	- [GGUF Model Format](https://huggingface.co/docs/hub/gguf)

	## Questions & Discussion

	Q: Why do we always set system prompt instead of only when present?

	A: To prevent contamination. If call 1 sets a system prompt but call 2 doesn't, call 2 would still use call 1's system prompt. Always setting (even to empty string) ensures clean state.

	Q: Why sequential batch processing instead of parallel?

	A: Local models (node-llama-cpp) can't run true parallel inference on a single model instance. The library serializes requests internally, so parallel Promise.all() provides no benefit and can cause race conditions on the shared chat session.

	Q: Why do we need random seeds for temperature?

	A: The node-llama-cpp library states: "The randomness of the temperature can be controlled by the seed parameter. Setting a specific seed and a specific temperature will yield the same response every time for the same input." Without a random seed, high temperature might still give deterministic results.

	Q: Can I use multiple models simultaneously?

	A: Yes! Each LlamaCppLLM instance can have a different model. Just be aware of memory constraints - each model takes several GB of RAM.

	Q: What's the difference between customStopTriggers and stopStrings?

	A: `customStopTriggers` is the correct parameter name in node-llama-cpp. We accept `stopStrings` in our config for a more intuitive API, then map it to `customStopTriggers` internally.

	---

	Built with ❤️ for learners who want to understand AI agents deeply

	[← Previous: Messages](02-messages.md) \| [Tutorial Index](../README.md) \| [Next: Context →](04-context.md)