Concept: Basic LLM Interaction
Overview
This example introduces the fundamental concepts of working with a Large Language Model (LLM) running locally on your machine. It demonstrates the simplest possible interaction: loading a model and asking it a question.
What is a Local LLM?
A Local LLM is an AI language model that runs entirely on your own computer, without requiring internet connectivity or external API calls. Key benefits:
- Privacy: Your data never leaves your machine
- Cost: No per-token API charges
- Control: Full control over model selection and parameters
- Offline: Works without internet connection
Core Components
1. Model Files (GGUF Format)
βββββββββββββββββββββββββββββββ
β Qwen3-1.7B-Q8_0.gguf β
β (Model Weights File) β
β β
β β’ Stores learned patterns β
β β’ Quantized for efficiency β
β β’ Loaded into RAM/VRAM β
βββββββββββββββββββββββββββββββ
- GGUF: File format optimized for llama.cpp
- Quantization: Reduces model size (e.g., 8-bit instead of 16-bit)
- Trade-off: Smaller size and faster speed vs. slight quality loss
2. The Inference Pipeline
User Input β Model β Generation β Response
β β β β
"Hello" Context Sampling "Hi there!"
Flow Diagram:
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β Prompt β --> β Context β --> β Model β --> β Response β
β β β (Memory) β β(Weights) β β (Text) β
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
3. Context Window
The context is the model's working memory:
βββββββββββββββββββββββββββββββββββββββββββ
β Context Window β
β βββββββββββββββββββββββββββββββββββ β
β β System Prompt (if any) β β
β βββββββββββββββββββββββββββββββββββ€ β
β β User: "do you know node-llama?" β β
β βββββββββββββββββββββββββββββββββββ€ β
β β AI: "Yes, I'm familiar..." β β
β βββββββββββββββββββββββββββββββββββ€ β
β β (Space for more conversation) β β
β βββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββ
- Limited size (e.g., 2048, 4096, or 8192 tokens)
- When full, old messages must be removed
- All previous messages influence the next response
How LLMs Generate Responses
Token-by-Token Generation
LLMs don't generate entire sentences at once. They predict one token (word piece) at a time:
Prompt: "What is AI?"
Generation Process:
"What is AI?" β [Model] β "AI"
"What is AI? AI" β [Model] β "is"
"What is AI? AI is" β [Model] β "a"
"What is AI? AI is a" β [Model] β "field"
... continues until stop condition
Visualization:
Input Prompt
β
ββββββββββββββ
β Model β β Token 1: "AI"
β Processes β β Token 2: "is"
β & Predictsβ β Token 3: "a"
ββββββββββββββ β Token 4: "field"
β ...
Key Concepts for AI Agents
1. Stateless Processing
- Each prompt is independent unless you maintain context
- The model has no memory between different script runs
- To build an "agent", you need to:
- Keep the context alive between prompts
- Maintain conversation history
- Add tools/functions (covered in later examples)
2. Prompt Engineering Basics
The way you phrase questions affects the response:
β Poor: "node-llama-cpp"
β
Better: "do you know node-llama-cpp"
β
Best: "Explain what node-llama-cpp is and how it works"
3. Resource Management
LLMs consume significant resources:
Model Loading
β
βββββββββββββββββββ
β RAM/VRAM Usage β β Models need gigabytes
β CPU/GPU Time β β Inference takes time
β Memory Leaks? β β Must cleanup properly
βββββββββββββββββββ
β
Proper Disposal
Why This Matters for Agents
This basic example establishes the foundation for AI agents:
- Agents need LLMs to "think": The model processes information and generates responses
- Agents need context: To maintain state across interactions
- Agents need structure: Later examples add tools, memory, and reasoning loops
Next Steps
After understanding basic prompting, explore:
- System prompts: Giving the model a specific role or behavior
- Function calling: Allowing the model to use tools
- Memory: Persisting information across sessions
- Reasoning patterns: Like ReAct (Reasoning + Acting)
Diagram: Complete Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Application β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β node-llama-cpp Library β β
β β ββββββββββββββββββββββββββββββββββββββββ β β
β β β llama.cpp (C++ Runtime) β β β
β β β ββββββββββββββββββββββββββββββββββ β β β
β β β β Model File (GGUF) β β β β
β β β β β’ Qwen3-1.7B-Q8_0.gguf β β β β
β β β ββββββββββββββββββββββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββ
β CPU / GPU β
ββββββββββββββββ
This layered architecture allows you to build sophisticated AI agents on top of basic LLM interactions.