| # Concept: Basic LLM Interaction | |
| ## Overview | |
| This example introduces the fundamental concepts of working with a Large Language Model (LLM) running locally on your machine. It demonstrates the simplest possible interaction: loading a model and asking it a question. | |
| ## What is a Local LLM? | |
| A **Local LLM** is an AI language model that runs entirely on your own computer, without requiring internet connectivity or external API calls. Key benefits: | |
| - **Privacy**: Your data never leaves your machine | |
| - **Cost**: No per-token API charges | |
| - **Control**: Full control over model selection and parameters | |
| - **Offline**: Works without internet connection | |
| ## Core Components | |
| ### 1. Model Files (GGUF Format) | |
| ``` | |
| βββββββββββββββββββββββββββββββ | |
| β Qwen3-1.7B-Q8_0.gguf β | |
| β (Model Weights File) β | |
| β β | |
| β β’ Stores learned patterns β | |
| β β’ Quantized for efficiency β | |
| β β’ Loaded into RAM/VRAM β | |
| βββββββββββββββββββββββββββββββ | |
| ``` | |
| - **GGUF**: File format optimized for llama.cpp | |
| - **Quantization**: Reduces model size (e.g., 8-bit instead of 16-bit) | |
| - **Trade-off**: Smaller size and faster speed vs. slight quality loss | |
| ### 2. The Inference Pipeline | |
| ``` | |
| User Input β Model β Generation β Response | |
| β β β β | |
| "Hello" Context Sampling "Hi there!" | |
| ``` | |
| **Flow Diagram:** | |
| ``` | |
| ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ | |
| β Prompt β --> β Context β --> β Model β --> β Response β | |
| β β β (Memory) β β(Weights) β β (Text) β | |
| ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ | |
| ``` | |
| ### 3. Context Window | |
| The **context** is the model's working memory: | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β Context Window β | |
| β βββββββββββββββββββββββββββββββββββ β | |
| β β System Prompt (if any) β β | |
| β βββββββββββββββββββββββββββββββββββ€ β | |
| β β User: "do you know node-llama?" β β | |
| β βββββββββββββββββββββββββββββββββββ€ β | |
| β β AI: "Yes, I'm familiar..." β β | |
| β βββββββββββββββββββββββββββββββββββ€ β | |
| β β (Space for more conversation) β β | |
| β βββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| - Limited size (e.g., 2048, 4096, or 8192 tokens) | |
| - When full, old messages must be removed | |
| - All previous messages influence the next response | |
| ## How LLMs Generate Responses | |
| ### Token-by-Token Generation | |
| LLMs don't generate entire sentences at once. They predict one **token** (word piece) at a time: | |
| ``` | |
| Prompt: "What is AI?" | |
| Generation Process: | |
| "What is AI?" β [Model] β "AI" | |
| "What is AI? AI" β [Model] β "is" | |
| "What is AI? AI is" β [Model] β "a" | |
| "What is AI? AI is a" β [Model] β "field" | |
| ... continues until stop condition | |
| ``` | |
| **Visualization:** | |
| ``` | |
| Input Prompt | |
| β | |
| ββββββββββββββ | |
| β Model β β Token 1: "AI" | |
| β Processes β β Token 2: "is" | |
| β & Predictsβ β Token 3: "a" | |
| ββββββββββββββ β Token 4: "field" | |
| β ... | |
| ``` | |
| ## Key Concepts for AI Agents | |
| ### 1. Stateless Processing | |
| - Each prompt is independent unless you maintain context | |
| - The model has no memory between different script runs | |
| - To build an "agent", you need to: | |
| - Keep the context alive between prompts | |
| - Maintain conversation history | |
| - Add tools/functions (covered in later examples) | |
| ### 2. Prompt Engineering Basics | |
| The way you phrase questions affects the response: | |
| ``` | |
| β Poor: "node-llama-cpp" | |
| β Better: "do you know node-llama-cpp" | |
| β Best: "Explain what node-llama-cpp is and how it works" | |
| ``` | |
| ### 3. Resource Management | |
| LLMs consume significant resources: | |
| ``` | |
| Model Loading | |
| β | |
| βββββββββββββββββββ | |
| β RAM/VRAM Usage β β Models need gigabytes | |
| β CPU/GPU Time β β Inference takes time | |
| β Memory Leaks? β β Must cleanup properly | |
| βββββββββββββββββββ | |
| β | |
| Proper Disposal | |
| ``` | |
| ## Why This Matters for Agents | |
| This basic example establishes the foundation for AI agents: | |
| 1. **Agents need LLMs to "think"**: The model processes information and generates responses | |
| 2. **Agents need context**: To maintain state across interactions | |
| 3. **Agents need structure**: Later examples add tools, memory, and reasoning loops | |
| ## Next Steps | |
| After understanding basic prompting, explore: | |
| - **System prompts**: Giving the model a specific role or behavior | |
| - **Function calling**: Allowing the model to use tools | |
| - **Memory**: Persisting information across sessions | |
| - **Reasoning patterns**: Like ReAct (Reasoning + Acting) | |
| ## Diagram: Complete Architecture | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Your Application β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β node-llama-cpp Library β β | |
| β β ββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β llama.cpp (C++ Runtime) β β β | |
| β β β ββββββββββββββββββββββββββββββββββ β β β | |
| β β β β Model File (GGUF) β β β β | |
| β β β β β’ Qwen3-1.7B-Q8_0.gguf β β β β | |
| β β β ββββββββββββββββββββββββββββββββββ β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββ β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββ β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββββ | |
| β CPU / GPU β | |
| ββββββββββββββββ | |
| ``` | |
| This layered architecture allows you to build sophisticated AI agents on top of basic LLM interactions. | |