lenzcom's picture
Upload folder using huggingface_hub
e706de2 verified
# The LLM Wrapper
**Part 1: Foundation - Lesson 3**
> Wrapping node-llama-cpp as a Runnable for seamless integration
## Overview
In Lesson 1, you learned about Runnables - the composable interface. In Lesson 2, you mastered Messages - the data structures. Now we'll connect these concepts by wrapping **node-llama-cpp** (our local LLM) as a Runnable that understands Messages.
By the end of this lesson, you'll have a LLM wrapper that can generate text, handle conversations, stream responses, and integrate seamlessly with chains.
## Why Does This Matter?
### The Problem: LLMs Don't Compose
node-llama-cpp is excellent at what it does - running local LLMs efficiently. But when you're building agents, you need more than just an LLM. You need components that work together seamlessly.
**Without a composable framework:**
```javascript
import { getLlama } from 'node-llama-cpp';
// Each component is isolated - they don't know about each other
async function myAgent(userInput) {
// Step 1: Format the prompt
const prompt = myCustomFormatter(userInput);
// Step 2: Call the LLM
const llama = await getLlama();
const model = await llama.loadModel({ modelPath: './model.gguf' });
const response = await model.createCompletion(prompt);
// Step 3: Parse the response
const parsed = myCustomParser(response);
// Step 4: Maybe call a tool?
if (parsed.needsTool) {
const toolResult = await myTool(parsed.args);
// Now what? Call the LLM again? How do we loop?
// How do we add logging? Memory? Retries?
}
return parsed;
}
// Problems:
// - Can't reuse components
// - Can't chain operations
// - Hard to add logging, metrics, or debugging
// - Complex control flow for agents
// - Every new feature requires changing everything
```
**With a composable framework:**
```javascript
// Components that work together
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
// Simple usage
const response = await llm.invoke([
new SystemMessage("You are helpful"),
new HumanMessage("Hi")
]);
// Returns: AIMessage("Hello! How can I help you?")
// But the real power is composition
const agent = promptTemplate
.pipe(llm)
.pipe(outputParser)
.pipe(toolExecutor);
// Now you can:
// βœ… Reuse components in different chains
// βœ… Add logging with callbacks (no code changes)
// βœ… Build complex agents that use tools
// βœ… Test each component independently
// βœ… Swap LLMs without rewriting everything
```
### What the Wrapper Provides
The LLM wrapper isn't about making node-llama-cpp easier - it's about making it **work with everything else**:
1. **Common Interface**: Same `invoke()` / `stream()` / `batch()` as every other component
2. **Message Support**: Understands HumanMessage, AIMessage, SystemMessage
3. **Composability**: Works with `.pipe()` to chain operations
4. **Observability**: Callbacks work automatically for logging/metrics
5. **Configuration**: Runtime settings pass through cleanly
6. **History Isolation**: Proper batch processing without contamination
Think of it as an adapter that lets node-llama-cpp play nicely with the rest of your agent system.
## Learning Objectives
By the end of this lesson, you will:
- βœ… Understand how to wrap complex libraries as Runnables
- βœ… Convert Messages to LLM chat history
- βœ… Handle model loading and lifecycle
- βœ… Implement streaming for real-time output
- βœ… Add temperature and other generation parameters
- βœ… Manage context windows and chat history
- βœ… Handle batch processing with history isolation
## Core Concepts
### What is an LLM Wrapper?
An LLM wrapper is an abstraction layer that:
1. **Hides complexity** - No need to manage contexts, sessions, or cleanup
2. **Provides a standard interface** - Same API regardless of underlying model
3. **Handles conversion** - Transforms Messages into model-specific chat history
4. **Manages resources** - Automatic initialization and cleanup
5. **Enables composition** - Works seamlessly in chains
6. **Isolates state** - Prevents history contamination in batch processing
### The Wrapper's Responsibilities
```
Input (Messages)
↓
[1. Convert to Chat History]
↓
[2. Manage System Prompt]
↓
[3. Call LLM]
↓
[4. Parse Response]
↓
Output (AIMessage)
```
### Key Challenges
1. **Model Loading**: Models are large and slow to load
2. **Chat History Format**: Convert Messages to node-llama-cpp format
3. **System Prompt Management**: Clear and set for each call
4. **Context Management**: Limited context windows
5. **Streaming**: Real-time output is complex
6. **Batch Isolation**: Prevent history contamination
7. **Error Handling**: Models can fail in various ways
8. **Chat Wrappers**: Different models need different formats
## Implementation Deep Dive
Let's build the LLM wrapper step by step.
### Step 1: The Base Structure
**Location:** `src/llm/llama-cpp-llm.js`
```javascript
import { Runnable } from './runnable.js';
import { AIMessage } from './message.js';
import { getLlama, LlamaChatSession } from 'node-llama-cpp';
export class LlamaCppLLM extends Runnable {
constructor(options = {}) {
super();
// Model configuration
this.modelPath = options.modelPath;
this.temperature = options.temperature ?? 0.7;
this.maxTokens = options.maxTokens ?? 2048;
this.contextSize = options.contextSize ?? 4096;
// Chat wrapper configuration (auto-detects by default)
this.chatWrapper = options.chatWrapper ?? 'auto';
// Internal state
this._llama = null;
this._model = null;
this._context = null;
this._chatSession = null;
this._initialized = false;
}
async _call(input, config) {
// Will implement next
}
}
```
**Key decisions**:
- Stores configuration (temperature, max tokens, etc.)
- Supports custom chat wrappers (e.g., QwenChatWrapper)
- Tracks internal state (model, context, session)
- Lazy initialization (load on first use)
### Step 2: Model Initialization with Chat Wrapper Support
```javascript
export class LlamaCppLLM extends Runnable {
// ... constructor ...
/**
* Initialize the model (lazy loading)
*/
async _initialize() {
if (this._initialized) return;
if (this.verbose) {
console.log(`Loading model: ${this.modelPath}`);
}
try {
// Step 1: Get llama instance
this._llama = await getLlama();
// Step 2: Load the model
this._model = await this._llama.loadModel({
modelPath: this.modelPath
});
// Step 3: Create context (working memory)
this._context = await this._model.createContext({
contextSize: this.contextSize,
batchSize: this.batchSize
});
// Step 4: Create chat session with optional chat wrapper
const contextSequence = this._context.getSequence();
const sessionConfig = { contextSequence };
// Add custom chat wrapper if specified
if (this.chatWrapper !== 'auto') {
sessionConfig.chatWrapper = this.chatWrapper;
}
this._chatSession = new LlamaChatSession(sessionConfig);
this._initialized = true;
if (this.verbose) {
console.log('βœ“ Model loaded successfully');
if (this.chatWrapper !== 'auto') {
console.log(`βœ“ Using custom chat wrapper: ${this.chatWrapper.constructor.name}`);
}
}
} catch (error) {
throw new Error(
`Failed to initialize model at ${this.modelPath}: ${error.message}`
);
}
}
/**
* Cleanup resources
*/
async dispose() {
if (this._context) {
await this._context.dispose();
this._context = null;
}
if (this._model) {
await this._model.dispose();
this._model = null;
}
this._chatSession = null;
this._initialized = false;
if (this.verbose) {
console.log('βœ“ Model resources disposed');
}
}
}
```
**Why lazy loading?**
- Models take 5-30 seconds to load
- Don't load until actually needed
- Share one loaded model across multiple calls
**Chat Wrapper Support**:
- Defaults to 'auto' (library auto-detects)
- Supports custom wrappers like QwenChatWrapper for specific models
- Useful for controlling model behavior (e.g., discouraging thoughts)
### Step 3: Converting Messages to Chat History
```javascript
export class LlamaCppLLM extends Runnable {
// ... previous code ...
/**
* Convert our Message objects to node-llama-cpp chat history format
*/
_messagesToChatHistory(messages) {
return messages.map(msg => {
// System messages: instructions for the AI
if (msg._type === 'system') {
return { type: 'system', text: msg.content };
}
// Human messages: user input
else if (msg._type === 'human') {
return { type: 'user', text: msg.content };
}
// AI messages: previous AI responses
else if (msg._type === 'ai') {
return { type: 'model', response: msg.content };
}
// Tool messages: results from tool execution
else if (msg._type === 'tool') {
return { type: 'system', text: `Tool Result: ${msg.content}` };
}
// Fallback: treat unknown types as user messages
return { type: 'user', text: msg.content };
});
}
}
```
**Key insight**: This bridges between your standardized Message types and what node-llama-cpp expects. Different models may need different chat formats, which is why chat wrappers exist.
### Step 4: The Main Generation Method
```javascript
export class LlamaCppLLM extends Runnable {
// ... previous code ...
async _call(input, config = {}) {
// Initialize if needed
await this._initialize();
// Clear history if requested (important for batch processing)
if (config.clearHistory) {
this._chatSession.setChatHistory([]);
}
// Handle different input types
let messages;
if (typeof input === 'string') {
messages = [new HumanMessage(input)];
} else if (Array.isArray(input)) {
messages = input;
} else {
throw new Error('Input must be string or array of messages');
}
// Extract system message if present
const systemMessages = messages.filter(msg => msg._type === 'system');
const systemPrompt = systemMessages.length > 0
? systemMessages[0].content
: '';
// Convert our Message objects to llama.cpp format
const chatHistory = this._messagesToChatHistory(messages);
this._chatSession.setChatHistory(chatHistory);
// ALWAYS set system prompt (either new value or empty string to clear)
this._chatSession.systemPrompt = systemPrompt;
try {
// Build prompt options
const promptOptions = {
temperature: config.temperature ?? this.temperature,
topP: config.topP ?? this.topP,
topK: config.topK ?? this.topK,
maxTokens: config.maxTokens ?? this.maxTokens,
repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,
customStopTriggers: config.stopStrings ?? this.stopStrings
};
// Add random seed if temperature > 0 and no seed specified
// This ensures randomness works properly
if (promptOptions.temperature > 0 && config.seed === undefined) {
promptOptions.seed = Math.floor(Math.random() * 1000000);
} else if (config.seed !== undefined) {
promptOptions.seed = config.seed;
}
// Generate response using prompt
const response = await this._chatSession.prompt('', promptOptions);
// Return as AIMessage for consistency
return new AIMessage(response);
} catch (error) {
throw new Error(`Generation failed: ${error.message}`);
}
}
}
```
**Critical details**:
- Always clears and sets system prompt (prevents contamination)
- Adds random seed for proper temperature behavior
- Uses `customStopTriggers` (correct parameter name)
- Supports `clearHistory` for batch processing
### Step 5: Batch Processing with History Isolation
```javascript
export class LlamaCppLLM extends Runnable {
// ... previous code ...
/**
* Batch processing with history isolation
*
* Processes multiple inputs sequentially, ensuring each gets
* a clean chat history to prevent contamination.
*/
async batch(inputs, config = {}) {
const results = [];
for (const input of inputs) {
// Clear history before each batch item
const result = await this._call(input, {
...config,
clearHistory: true
});
results.push(result);
}
return results;
}
}
```
**Why sequential processing?**
- Local models can't run truly in parallel
- Sequential ensures proper history isolation
- Each item gets a clean slate
### Step 6: Streaming Support
For real-time output (like ChatGPT's typing effect):
```javascript
export class LlamaCppLLM extends Runnable {
// ... previous code ...
async *_stream(input, config = {}) {
await this._initialize();
// Clear history if requested
if (config.clearHistory) {
this._chatSession.setChatHistory([]);
}
// Handle input types (same as _call)
let messages;
if (typeof input === 'string') {
messages = [new HumanMessage(input)];
} else if (Array.isArray(input)) {
messages = input;
} else {
throw new Error('Input must be string or array of messages');
}
// Extract system message
const systemMessages = messages.filter(msg => msg._type === 'system');
const systemPrompt = systemMessages.length > 0
? systemMessages[0].content
: '';
// Set up chat history
const chatHistory = this._messagesToChatHistory(messages);
this._chatSession.setChatHistory(chatHistory);
// ALWAYS set system prompt
this._chatSession.systemPrompt = systemPrompt;
try {
// Build prompt options
const promptOptions = {
temperature: config.temperature ?? this.temperature,
topP: config.topP ?? this.topP,
topK: config.topK ?? this.topK,
maxTokens: config.maxTokens ?? this.maxTokens,
repeatPenalty: config.repeatPenalty ?? this.repeatPenalty,
customStopTriggers: config.stopStrings ?? this.stopStrings
};
// Add random seed
if (promptOptions.temperature > 0 && config.seed === undefined) {
promptOptions.seed = Math.floor(Math.random() * 1000000);
} else if (config.seed !== undefined) {
promptOptions.seed = config.seed;
}
// Use onTextChunk callback to collect chunks
const self = this;
promptOptions.onTextChunk = (chunk) => {
self._currentStreamChunks = self._currentStreamChunks || [];
self._currentStreamChunks.push(chunk);
};
// Initialize chunk collection
this._currentStreamChunks = [];
// Start generation
const responsePromise = this._chatSession.prompt('', promptOptions);
// Yield chunks as they become available
let lastYieldedIndex = 0;
// Poll for new chunks
while (true) {
// Yield any new chunks
while (lastYieldedIndex < this._currentStreamChunks.length) {
yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {
additionalKwargs: { chunk: true }
});
lastYieldedIndex++;
}
// Check if generation is complete
const isDone = await Promise.race([
responsePromise.then(() => true),
new Promise(resolve => setTimeout(() => resolve(false), 10))
]);
if (isDone) {
// Yield any remaining chunks
while (lastYieldedIndex < this._currentStreamChunks.length) {
yield new AIMessage(this._currentStreamChunks[lastYieldedIndex], {
additionalKwargs: { chunk: true }
});
lastYieldedIndex++;
}
break;
}
}
// Wait for completion
await responsePromise;
// Clean up
delete this._currentStreamChunks;
} catch (error) {
throw new Error(`Streaming failed: ${error.message}`);
}
}
}
```
**Streaming challenges**:
- `onTextChunk` is a synchronous callback
- Can't yield directly from callback
- Use polling mechanism to yield as chunks arrive
- 10ms polling interval balances responsiveness vs CPU usage
## Real-World Examples
### Example 1: Simple Text Generation
```javascript
const llm = new LlamaCppLLM({
modelPath: './models/Meta-Llama-3.1-8B-Instruct-Q5_K_S.gguf',
temperature: 0.7,
maxTokens: 100
});
// Simple string input
const response = await llm.invoke("What is 2+2?");
console.log(response.content); // "2+2 equals 4."
```
### Example 2: Conversation with System Prompt
```javascript
const messages = [
new SystemMessage("You are a helpful math tutor."),
new HumanMessage("What is 5*5?")
];
const response = await llm.invoke(messages);
console.log(response.content);
// "5 times 5 is 25. Here's a simple explanation..."
```
### Example 3: Using Qwen Chat Wrapper
```javascript
import { QwenChatWrapper } from 'node-llama-cpp';
const llm = new LlamaCppLLM({
modelPath: './models/Qwen3-1.7B-Q6_K.gguf',
temperature: 0.7,
chatWrapper: new QwenChatWrapper({
thoughts: 'discourage' // Prevents thinking tokens
})
});
const response = await llm.invoke("What is AI?");
// Response won't include <think> tokens
```
### Example 4: Temperature Comparison
```javascript
const question = "Give me one adjective to describe winter:";
// Low temperature - consistent answers
llm._chatSession.setChatHistory([]);
const lowTemp = await llm.invoke(question, { temperature: 0.1 });
// Likely: "cold"
// High temperature - varied answers
llm._chatSession.setChatHistory([]);
const highTemp = await llm.invoke(question, { temperature: 0.9 });
// Could be: "frosty", "snowy", "icy", "chilly"
```
### Example 5: Streaming Output
```javascript
console.log('Response: ');
for await (const chunk of llm.stream("Tell me a fun fact about space")) {
process.stdout.write(chunk.content); // No newline
}
console.log('\n');
// Output streams in real-time as it's generated
```
### Example 6: Batch Processing
```javascript
const questions = [
"What is Python?",
"What is JavaScript?",
"What is Rust?"
];
const answers = await llm.batch(questions);
questions.forEach((q, i) => {
console.log(`Q: ${q}`);
console.log(`A: ${answers[i].content}`);
console.log();
});
// Each answer is independent - no history contamination!
```
### Example 7: Using in a Pipeline
```javascript
import { PromptTemplate } from '../prompts/prompt-template.js';
const prompt = PromptTemplate.fromTemplate(
"Translate the following to {language}: {text}"
);
const chain = prompt.pipe(llm);
const result = await chain.invoke({
language: "Spanish",
text: "Hello, how are you?"
});
console.log(result.content); // "Hola, ΒΏcΓ³mo estΓ‘s?"
```
## Advanced Patterns
### Pattern 1: Model Pool (Reusing Loaded Models)
```javascript
class LLMPool {
constructor() {
this.models = new Map();
}
async get(modelPath, options = {}) {
if (!this.models.has(modelPath)) {
const llm = new LlamaCppLLM({ modelPath, ...options });
await llm._initialize(); // Pre-load
this.models.set(modelPath, llm);
}
return this.models.get(modelPath);
}
async disposeAll() {
for (const llm of this.models.values()) {
await llm.dispose();
}
this.models.clear();
}
}
// Usage
const pool = new LLMPool();
const llm = await pool.get('./models/llama-3.1-8b.gguf');
```
### Pattern 2: Retry on Failure
```javascript
class ReliableLLM extends LlamaCppLLM {
async _call(input, config = {}) {
const maxRetries = config.maxRetries || 3;
let lastError;
for (let i = 0; i < maxRetries; i++) {
try {
return await super._call(input, config);
} catch (error) {
lastError = error;
console.warn(`Attempt ${i + 1} failed, retrying...`);
await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1)));
}
}
throw new Error(`All ${maxRetries} attempts failed: ${lastError.message}`);
}
}
```
### Pattern 3: Token Counting
```javascript
class LlamaCppLLMWithCounting extends LlamaCppLLM {
constructor(options) {
super(options);
this.totalTokens = 0;
}
async _call(input, config = {}) {
const result = await super._call(input, config);
// Rough token estimation (4 chars β‰ˆ 1 token)
const promptTokens = Math.ceil(JSON.stringify(input).length / 4);
const completionTokens = Math.ceil(result.content.length / 4);
this.totalTokens += promptTokens + completionTokens;
result.additionalKwargs.usage = {
promptTokens,
completionTokens,
totalTokens: promptTokens + completionTokens
};
return result;
}
getUsage() {
return { totalTokens: this.totalTokens };
}
}
```
## Common Patterns and Best Practices
### βœ… DO:
```javascript
// Initialize once, use many times
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
await llm.invoke("Question 1");
await llm.invoke("Question 2");
await llm.dispose(); // Cleanup when done
// Use Messages for structure
const messages = [
new SystemMessage("You are helpful"),
new HumanMessage("Hi")
];
// Clear history for independent calls
const response = await llm.invoke(messages, { clearHistory: true });
// Handle errors gracefully
try {
const result = await llm.invoke(messages);
} catch (error) {
console.error('Generation failed:', error);
}
```
### ❌ DON'T:
```javascript
// Don't create new LLM for each request (slow!)
for (const question of questions) {
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
await llm.invoke(question); // Loads model every time!
}
// Don't forget to clear history in batch processing
// This will cause history contamination!
for (const q of questions) {
await llm.invoke(q); // Sees all previous questions!
}
// Don't forget cleanup
// Missing: await llm.dispose()
```
## Performance Tips
### Tip 1: Preload Models
```javascript
// Load during app startup
const llm = new LlamaCppLLM({ modelPath: './model.gguf' });
await llm._initialize(); // Force load now
// Later requests are instant
await llm.invoke("Fast response!");
```
### Tip 2: Use Batch Properly
```javascript
// This correctly isolates each question
const answers = await llm.batch(questions);
// Not: Sequential with contamination
for (const q of questions) {
await llm.invoke(q); // History builds up!
}
```
### Tip 3: Adjust Context Size
```javascript
// Smaller context = faster, less memory
const fastLLM = new LlamaCppLLM({
modelPath: './model.gguf',
contextSize: 2048 // vs default 4096
});
```
### Tip 4: Use Appropriate Temperature
```javascript
// Factual answers: low temperature
const fact = await llm.invoke(query, { temperature: 0.1 });
// Creative writing: high temperature
const story = await llm.invoke(query, { temperature: 0.9 });
```
## Debugging Tips
### Tip 1: Enable Verbose Mode
```javascript
const llm = new LlamaCppLLM({
modelPath: './model.gguf',
verbose: true // Shows loading and generation details
});
```
### Tip 2: Test History Isolation
```javascript
// Test batch processing
const questions = ["Q1", "Q2", "Q3"];
const answers = await llm.batch(questions);
// Each answer should be independent
// If Q2 mentions Q1, history contamination occurred!
```
### Tip 3: Verify Streaming
```javascript
// Verify streaming works
console.log('Testing stream:');
for await (const chunk of llm.stream("Count to 5")) {
console.log('Chunk:', chunk.content);
}
```
## Common Mistakes
### ❌ Mistake 1: Not Clearing History in Batches
```javascript
// Bad: History contamination
const pipeline = formatter.pipe(llm).pipe(parser);
const results = await pipeline.batch(inputs); // Q2 sees Q1!
```
**Fix**: The LlamaCppLLM.batch() method automatically clears history:
```javascript
// Good: Each input is isolated
const results = await llm.batch(inputs);
```
### ❌ Mistake 2: Forgetting Random Seed
```javascript
// Bad: Temperature doesn't work
const response = await llm.invoke(prompt, { temperature: 0.9 });
// Without random seed, might get same answer
```
**Fix**: Our implementation automatically adds random seed:
```javascript
// Good: Randomness works properly
if (promptOptions.temperature > 0 && config.seed === undefined) {
promptOptions.seed = Math.floor(Math.random() * 1000000);
}
```
### ❌ Mistake 3: Not Setting System Prompt Properly
```javascript
// Bad: System prompt persists between calls
await llm.invoke([new SystemMessage("Be creative"), ...]);
await llm.invoke([new HumanMessage("Hi")]); // Still "creative"!
```
**Fix**: Always set system prompt (empty string to clear):
```javascript
// Good: Always explicitly set or clear
this._chatSession.systemPrompt = systemPrompt || '';
```
## Mental Model
Think of the LLM wrapper as managing a conversation session:
```
Call 1: [System: "Be helpful", User: "Hi"]
↓
Model generates response
↓
Returns: AIMessage("Hello!")
Call 2: [User: "How are you?"]
↓
PROBLEM: Still has "Be helpful" system prompt!
PROBLEM: Might remember "Hi" conversation!
SOLUTION: Clear history + reset system prompt between calls
```
The wrapper handles:
- Loading models once
- Converting Messages to chat history
- Managing system prompts
- Clearing history when needed
- Streaming chunks
- Random seeds for temperature
- Error handling
## Summary
Congratulations! You now understand how to wrap a complex LLM library as a clean, composable Runnable with proper state management.
### Key Takeaways
1. **Lazy loading saves time**: Load models only when needed
2. **Messages enable structure**: Proper conversation formatting
3. **History isolation prevents bugs**: Critical for batch processing
4. **System prompts must be managed**: Always set or clear explicitly
5. **Streaming improves UX**: Real-time output feels responsive
6. **Random seeds enable temperature**: Required for randomness
7. **Chat wrappers add flexibility**: Support different models
8. **Sequential batch processing**: Local models can't truly parallelize
### What You Built
A LLM wrapper that:
- βœ… Loads models lazily
- βœ… Handles Messages properly
- βœ… Manages chat history correctly
- βœ… Isolates batches
- βœ… Supports streaming
- βœ… Handles system prompts
- βœ… Supports chat wrappers
- βœ… Adds random seeds for temperature
- βœ… Provides good error messages
### Critical Implementation Details
```javascript
// 1. Always clear and set system prompt
this._chatSession.systemPrompt = systemPrompt || '';
// 2. Use clearHistory for batch isolation
async batch(inputs, config = {}) {
const results = [];
for (const input of inputs) {
const result = await this._call(input, {
...config,
clearHistory: true
});
results.push(result);
}
return results;
}
// 3. Add random seed for temperature
if (promptOptions.temperature > 0 && config.seed === undefined) {
promptOptions.seed = Math.floor(Math.random() * 1000000);
}
// 4. Use correct parameter names
customStopTriggers: config.stopStrings ?? this.stopStrings
```
### What's Next
In the next lesson, we'll explore **Context & Configuration** - how to pass state and settings through chains.
**Preview**: You'll learn:
- RunnableConfig object
- Callback systems
- Metadata tracking
- Debug modes
➑️ [Continue to Lesson 4: Context & Configuration](04-context.md)
## Additional Resources
- [node-llama-cpp Documentation](https://node-llama-cpp.withcat.ai)
- [Chat Wrappers Guide](https://node-llama-cpp.withcat.ai/guide/chat-wrapper)
- [Temperature Guide](https://node-llama-cpp.withcat.ai/guide/chat-session#temperature)
- [GGUF Model Format](https://huggingface.co/docs/hub/gguf)
## Questions & Discussion
**Q: Why do we always set system prompt instead of only when present?**
A: To prevent contamination. If call 1 sets a system prompt but call 2 doesn't, call 2 would still use call 1's system prompt. Always setting (even to empty string) ensures clean state.
**Q: Why sequential batch processing instead of parallel?**
A: Local models (node-llama-cpp) can't run true parallel inference on a single model instance. The library serializes requests internally, so parallel Promise.all() provides no benefit and can cause race conditions on the shared chat session.
**Q: Why do we need random seeds for temperature?**
A: The node-llama-cpp library states: "The randomness of the temperature can be controlled by the seed parameter. Setting a specific seed and a specific temperature will yield the same response every time for the same input." Without a random seed, high temperature might still give deterministic results.
**Q: Can I use multiple models simultaneously?**
A: Yes! Each LlamaCppLLM instance can have a different model. Just be aware of memory constraints - each model takes several GB of RAM.
**Q: What's the difference between customStopTriggers and stopStrings?**
A: `customStopTriggers` is the correct parameter name in node-llama-cpp. We accept `stopStrings` in our config for a more intuitive API, then map it to `customStopTriggers` internally.
---
**Built with ❀️ for learners who want to understand AI agents deeply**
[← Previous: Messages](02-messages.md) | [Tutorial Index](../README.md) | [Next: Context β†’](04-context.md)