Spaces:

lenzcom
/

Email

Running

App Files Files Community

Email / examples /05_batch /CODE.md

lenzcom

Upload folder using huggingface_hub

e706de2 verified 1 day ago

preview code

raw

history blame contribute delete

9.61 kB

	# Code Explanation: batch.js

	This file demonstrates parallel execution of multiple LLM prompts using separate context sequences, enabling concurrent processing for better performance.

	## Step-by-Step Code Breakdown

	### 1. Import and Setup (Lines 1-10)
	```javascript
	import {getLlama, LlamaChatSession} from "node-llama-cpp";
	import path from "path";
	import {fileURLToPath} from "url";

	/**
	* Asynchronous execution improves performance in GAIA benchmarks,
	* multi-agent applications, and other high-throughput scenarios.
	*/

	const __dirname = path.dirname(fileURLToPath(import.meta.url));
	```
	- Standard imports for LLM interaction
	- Comment explains the performance benefit
	- GAIA benchmark: A standard for testing AI agent performance
	- Useful for multi-agent systems that need to handle many requests

	### 2. Model Path Configuration (Lines 11-16)
	```javascript
	const modelPath = path.join(
	__dirname,
	"../",
	"models",
	"DeepSeek-R1-0528-Qwen3-8B-Q6_K.gguf"
	)
	```
	- Uses DeepSeek-R1: An 8B parameter model optimized for reasoning
	- Q6_K quantization: Balance between quality and size
	- Model is loaded once and shared between sequences

	### 3. Initialize Llama and Load Model (Lines 18-19)
	```javascript
	const llama = await getLlama();
	const model = await llama.loadModel({modelPath});
	```
	- Standard initialization
	- Model is loaded into memory once
	- Will be used by multiple sequences simultaneously

	### 4. Create Context with Multiple Sequences (Lines 20-23)
	```javascript
	const context = await model.createContext({
	sequences: 2,
	batchSize: 1024 // The number of tokens that can be processed at once by the GPU.
	});
	```

	Key parameters:

	- sequences: 2: Creates 2 independent conversation sequences
	- Each sequence has its own conversation history
	- Both share the same model and context memory pool
	- Can be processed in parallel

	- batchSize: 1024: Maximum tokens processed per GPU batch
	- Larger = better GPU utilization
	- Smaller = lower memory usage
	- 1024 is a good balance for most GPUs

	### Why Multiple Sequences?

	```
	Single Sequence (Sequential) Multiple Sequences (Parallel)
	───────────────────────── ──────────────────────────────
	Process Prompt 1 → Response 1 Process Prompt 1 ──┐
	Wait... ├→ Both responses
	Process Prompt 2 → Response 2 Process Prompt 2 ──┘ in parallel!

	Total Time: T1 + T2 Total Time: max(T1, T2)
	```

	### 5. Get Individual Sequences (Lines 25-26)
	```javascript
	const sequence1 = context.getSequence();
	const sequence2 = context.getSequence();
	```
	- Retrieves two separate sequence objects from the context
	- Each sequence maintains its own state
	- They can be used independently for different conversations

	### 6. Create Separate Sessions (Lines 28-33)
	```javascript
	const session1 = new LlamaChatSession({
	contextSequence: sequence1
	});
	const session2 = new LlamaChatSession({
	contextSequence: sequence2
	});
	```
	- Creates a chat session for each sequence
	- Each session has its own conversation history
	- Sessions are completely independent
	- No system prompts in this example (could be added)

	### 7. Define Questions (Lines 35-36)
	```javascript
	const q1 = "Hi there, how are you?";
	const q2 = "How much is 6+6?";
	```
	- Two completely different questions
	- Will be processed simultaneously
	- Different types: conversational vs. computational

	### 8. Parallel Execution with Promise.all (Lines 38-44)
	```javascript
	const [
	a1,
	a2
	] = await Promise.all([
	session1.prompt(q1),
	session2.prompt(q2)
	]);
	```

	How this works:

	1. `session1.prompt(q1)` starts asynchronously
	2. `session2.prompt(q2)` starts asynchronously (doesn't wait for #1)
	3. `Promise.all()` waits for BOTH to complete
	4. Returns results in array: [response1, response2]
	5. Destructures into `a1` and `a2`

	Key benefit: Both prompts are processed at the same time, not one after another!

	### 9. Display Results (Lines 46-50)
	```javascript
	console.log("User: " + q1);
	console.log("AI: " + a1);

	console.log("User: " + q2);
	console.log("AI: " + a2);
	```
	- Outputs both question-answer pairs
	- Results appear in order despite parallel processing

	## Key Concepts Demonstrated

	### 1. Parallel Processing
	Instead of:
	```javascript
	// Sequential (slow)
	const a1 = await session1.prompt(q1); // Wait
	const a2 = await session2.prompt(q2); // Wait again
	```

	We use:
	```javascript
	// Parallel (fast)
	const [a1, a2] = await Promise.all([
	session1.prompt(q1),
	session2.prompt(q2)
	]);
	```

	### 2. Context Sequences
	A context can hold multiple independent sequences:

	```
	┌─────────────────────────────────────┐
	│ Context (Shared) │
	│ ┌───────────────────────────────┐ │
	│ │ Model Weights (8B params) │ │
	│ └───────────────────────────────┘ │
	│ │
	│ ┌─────────────┐ ┌─────────────┐ │
	│ │ Sequence 1 │ │ Sequence 2 │ │
	│ │ "Hi there" │ │ "6+6?" │ │
	│ │ History... │ │ History... │ │
	│ └─────────────┘ └─────────────┘ │
	└─────────────────────────────────────┘
	```

	## Performance Comparison

	### Sequential Execution
	```
	Request 1: 2 seconds
	Request 2: 2 seconds
	Total: 4 seconds
	```

	### Parallel Execution (This Example)
	```
	Request 1: 2 seconds ──┐
	Request 2: 2 seconds ──┤ Both running
	Total: ~2 seconds └─ simultaneously
	```

	Speedup: ~2x for 2 sequences, scales with more sequences

	## Use Cases

	### 1. Multi-User Applications
	```javascript
	// Handle multiple users simultaneously
	const [user1Response, user2Response, user3Response] = await Promise.all([
	session1.prompt(user1Query),
	session2.prompt(user2Query),
	session3.prompt(user3Query)
	]);
	```

	### 2. Multi-Agent Systems
	```javascript
	// Multiple agents working on different tasks
	const [
	plannerResponse,
	analyzerResponse,
	executorResponse
	] = await Promise.all([
	plannerSession.prompt("Plan the task"),
	analyzerSession.prompt("Analyze the data"),
	executorSession.prompt("Execute step 1")
	]);
	```

	### 3. Benchmarking
	```javascript
	// Test multiple prompts for evaluation
	const results = await Promise.all(
	testPrompts.map(prompt => session.prompt(prompt))
	);
	```

	### 4. A/B Testing
	```javascript
	// Test different system prompts
	const [responseA, responseB] = await Promise.all([
	sessionWithPromptA.prompt(query),
	sessionWithPromptB.prompt(query)
	]);
	```

	## Resource Considerations

	### Memory Usage
	Each sequence needs memory for:
	- Conversation history
	- Intermediate computations
	- KV cache (key-value cache for transformer attention)

	Rule of thumb: More sequences = more memory needed

	### GPU Utilization
	- Single sequence: May underutilize GPU
	- Multiple sequences: Better GPU utilization
	- Too many sequences: May exceed VRAM, causing slowdown

	### Optimal Number of Sequences
	Depends on:
	- Available VRAM
	- Model size
	- Context length
	- Batch size

	Typical: 2-8 sequences for consumer GPUs

	## Limitations & Considerations

	### 1. Shared Context Limit
	All sequences share the same context memory pool:
	```
	Total context size: 8192 tokens
	Sequence 1: 4096 tokens
	Sequence 2: 4096 tokens
	Maximum distribution!
	```

	### 2. Not True Parallelism for CPU
	On CPU-only systems, sequences are interleaved, not truly parallel. Still provides better overall throughput.

	### 3. Model Loading Overhead
	The model is loaded once and shared, which is efficient. But initial loading still takes time.

	## Why This Matters for AI Agents

	### Efficiency in Production
	Real-world agent systems need to:
	- Handle multiple requests concurrently
	- Respond quickly to users
	- Make efficient use of hardware

	### Multi-Agent Architectures
	Complex agent systems often have:
	- Planner agent: Thinks about strategy
	- Executor agent: Takes actions
	- Critic agent: Evaluates results

	These can run in parallel using separate sequences.

	### Scalability
	This pattern is the foundation for:
	- Web services with multiple users
	- Batch processing of data
	- Distributed agent systems

	## Best Practices

	1. Match sequences to workload: Don't create more than you need
	2. Monitor memory usage: Each sequence consumes VRAM
	3. Use appropriate batch size: Balance speed vs. memory
	4. Clean up resources: Always dispose when done
	5. Handle errors: Wrap Promise.all in try-catch

	## Expected Output

	Running this script should output something like:
	```
	User: Hi there, how are you?
	AI: Hello! I'm doing well, thank you for asking...

	User: How much is 6+6?
	AI: 12
	```

	Both responses appear quickly because they were processed simultaneously!