Email / examples /01_intro /CONCEPT.md
lenzcom's picture
Upload folder using huggingface_hub
e706de2 verified
# Concept: Basic LLM Interaction
## Overview
This example introduces the fundamental concepts of working with a Large Language Model (LLM) running locally on your machine. It demonstrates the simplest possible interaction: loading a model and asking it a question.
## What is a Local LLM?
A **Local LLM** is an AI language model that runs entirely on your own computer, without requiring internet connectivity or external API calls. Key benefits:
- **Privacy**: Your data never leaves your machine
- **Cost**: No per-token API charges
- **Control**: Full control over model selection and parameters
- **Offline**: Works without internet connection
## Core Components
### 1. Model Files (GGUF Format)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Qwen3-1.7B-Q8_0.gguf β”‚
β”‚ (Model Weights File) β”‚
β”‚ β”‚
β”‚ β€’ Stores learned patterns β”‚
β”‚ β€’ Quantized for efficiency β”‚
β”‚ β€’ Loaded into RAM/VRAM β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
- **GGUF**: File format optimized for llama.cpp
- **Quantization**: Reduces model size (e.g., 8-bit instead of 16-bit)
- **Trade-off**: Smaller size and faster speed vs. slight quality loss
### 2. The Inference Pipeline
```
User Input β†’ Model β†’ Generation β†’ Response
↓ ↓ ↓ ↓
"Hello" Context Sampling "Hi there!"
```
**Flow Diagram:**
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Prompt β”‚ --> β”‚ Context β”‚ --> β”‚ Model β”‚ --> β”‚ Response β”‚
β”‚ β”‚ β”‚ (Memory) β”‚ β”‚(Weights) β”‚ β”‚ (Text) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### 3. Context Window
The **context** is the model's working memory:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Context Window β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ System Prompt (if any) β”‚ β”‚
β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
β”‚ β”‚ User: "do you know node-llama?" β”‚ β”‚
β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
β”‚ β”‚ AI: "Yes, I'm familiar..." β”‚ β”‚
β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
β”‚ β”‚ (Space for more conversation) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
- Limited size (e.g., 2048, 4096, or 8192 tokens)
- When full, old messages must be removed
- All previous messages influence the next response
## How LLMs Generate Responses
### Token-by-Token Generation
LLMs don't generate entire sentences at once. They predict one **token** (word piece) at a time:
```
Prompt: "What is AI?"
Generation Process:
"What is AI?" β†’ [Model] β†’ "AI"
"What is AI? AI" β†’ [Model] β†’ "is"
"What is AI? AI is" β†’ [Model] β†’ "a"
"What is AI? AI is a" β†’ [Model] β†’ "field"
... continues until stop condition
```
**Visualization:**
```
Input Prompt
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model β”‚ β†’ Token 1: "AI"
β”‚ Processes β”‚ β†’ Token 2: "is"
β”‚ & Predictsβ”‚ β†’ Token 3: "a"
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β†’ Token 4: "field"
β†’ ...
```
## Key Concepts for AI Agents
### 1. Stateless Processing
- Each prompt is independent unless you maintain context
- The model has no memory between different script runs
- To build an "agent", you need to:
- Keep the context alive between prompts
- Maintain conversation history
- Add tools/functions (covered in later examples)
### 2. Prompt Engineering Basics
The way you phrase questions affects the response:
```
❌ Poor: "node-llama-cpp"
βœ… Better: "do you know node-llama-cpp"
βœ… Best: "Explain what node-llama-cpp is and how it works"
```
### 3. Resource Management
LLMs consume significant resources:
```
Model Loading
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RAM/VRAM Usage β”‚ ← Models need gigabytes
β”‚ CPU/GPU Time β”‚ ← Inference takes time
β”‚ Memory Leaks? β”‚ ← Must cleanup properly
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
Proper Disposal
```
## Why This Matters for Agents
This basic example establishes the foundation for AI agents:
1. **Agents need LLMs to "think"**: The model processes information and generates responses
2. **Agents need context**: To maintain state across interactions
3. **Agents need structure**: Later examples add tools, memory, and reasoning loops
## Next Steps
After understanding basic prompting, explore:
- **System prompts**: Giving the model a specific role or behavior
- **Function calling**: Allowing the model to use tools
- **Memory**: Persisting information across sessions
- **Reasoning patterns**: Like ReAct (Reasoning + Acting)
## Diagram: Complete Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Your Application β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ node-llama-cpp Library β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ llama.cpp (C++ Runtime) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ Model File (GGUF) β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β€’ Qwen3-1.7B-Q8_0.gguf β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↕
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPU / GPU β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
This layered architecture allows you to build sophisticated AI agents on top of basic LLM interactions.