File size: 7,551 Bytes
e706de2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
# Concept: Basic LLM Interaction

## Overview

This example introduces the fundamental concepts of working with a Large Language Model (LLM) running locally on your machine. It demonstrates the simplest possible interaction: loading a model and asking it a question.

## What is a Local LLM?

A **Local LLM** is an AI language model that runs entirely on your own computer, without requiring internet connectivity or external API calls. Key benefits:

- **Privacy**: Your data never leaves your machine
- **Cost**: No per-token API charges
- **Control**: Full control over model selection and parameters
- **Offline**: Works without internet connection

## Core Components

### 1. Model Files (GGUF Format)

```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚   Qwen3-1.7B-Q8_0.gguf     β”‚

β”‚   (Model Weights File)      β”‚

β”‚                             β”‚

β”‚  β€’ Stores learned patterns  β”‚

β”‚  β€’ Quantized for efficiency β”‚

β”‚  β€’ Loaded into RAM/VRAM     β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

```

- **GGUF**: File format optimized for llama.cpp
- **Quantization**: Reduces model size (e.g., 8-bit instead of 16-bit)
- **Trade-off**: Smaller size and faster speed vs. slight quality loss

### 2. The Inference Pipeline

```

User Input β†’ Model β†’ Generation β†’ Response

    ↓          ↓          ↓           ↓

 "Hello"   Context   Sampling    "Hi there!"

```

**Flow Diagram:**
```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚  Prompt  β”‚ --> β”‚ Context  β”‚ --> β”‚  Model   β”‚ --> β”‚ Response β”‚

β”‚          β”‚     β”‚ (Memory) β”‚     β”‚(Weights) β”‚     β”‚  (Text)  β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

```

### 3. Context Window

The **context** is the model's working memory:

```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚           Context Window                β”‚

β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚

β”‚  β”‚ System Prompt (if any)          β”‚   β”‚

β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚

β”‚  β”‚ User: "do you know node-llama?" β”‚   β”‚

β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚

β”‚  β”‚ AI: "Yes, I'm familiar..."      β”‚   β”‚

β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚

β”‚  β”‚ (Space for more conversation)   β”‚   β”‚

β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

```

- Limited size (e.g., 2048, 4096, or 8192 tokens)
- When full, old messages must be removed
- All previous messages influence the next response

## How LLMs Generate Responses

### Token-by-Token Generation

LLMs don't generate entire sentences at once. They predict one **token** (word piece) at a time:

```

Prompt: "What is AI?"



Generation Process:

"What is AI?" β†’ [Model] β†’ "AI"

"What is AI? AI" β†’ [Model] β†’ "is"

"What is AI? AI is" β†’ [Model] β†’ "a"

"What is AI? AI is a" β†’ [Model] β†’ "field"

... continues until stop condition

```

**Visualization:**
```

Input Prompt

     ↓

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚   Model    β”‚ β†’ Token 1: "AI"

β”‚ Processes  β”‚ β†’ Token 2: "is"

β”‚   & Predictsβ”‚ β†’ Token 3: "a"

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β†’ Token 4: "field"

                β†’ ...

```

## Key Concepts for AI Agents

### 1. Stateless Processing
- Each prompt is independent unless you maintain context
- The model has no memory between different script runs
- To build an "agent", you need to:
  - Keep the context alive between prompts
  - Maintain conversation history
  - Add tools/functions (covered in later examples)

### 2. Prompt Engineering Basics
The way you phrase questions affects the response:

```

❌ Poor: "node-llama-cpp"

βœ… Better: "do you know node-llama-cpp"

βœ… Best: "Explain what node-llama-cpp is and how it works"

```

### 3. Resource Management
LLMs consume significant resources:

```

Model Loading

     ↓

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚  RAM/VRAM Usage β”‚  ← Models need gigabytes

β”‚  CPU/GPU Time   β”‚  ← Inference takes time

β”‚  Memory Leaks?  β”‚  ← Must cleanup properly

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

     ↓

Proper Disposal

```

## Why This Matters for Agents

This basic example establishes the foundation for AI agents:

1. **Agents need LLMs to "think"**: The model processes information and generates responses
2. **Agents need context**: To maintain state across interactions
3. **Agents need structure**: Later examples add tools, memory, and reasoning loops

## Next Steps

After understanding basic prompting, explore:
- **System prompts**: Giving the model a specific role or behavior
- **Function calling**: Allowing the model to use tools
- **Memory**: Persisting information across sessions
- **Reasoning patterns**: Like ReAct (Reasoning + Acting)

## Diagram: Complete Architecture

```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚            Your Application                      β”‚

β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚

β”‚  β”‚         node-llama-cpp Library             β”‚ β”‚

β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚

β”‚  β”‚  β”‚      llama.cpp (C++ Runtime)         β”‚  β”‚ β”‚

β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚ β”‚

β”‚  β”‚  β”‚  β”‚   Model File (GGUF)            β”‚  β”‚  β”‚ β”‚

β”‚  β”‚  β”‚  β”‚   β€’ Qwen3-1.7B-Q8_0.gguf       β”‚  β”‚  β”‚ β”‚

β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚ β”‚

β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚

β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

           ↕

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

    β”‚  CPU / GPU   β”‚

    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

```

This layered architecture allows you to build sophisticated AI agents on top of basic LLM interactions.