Email / examples /01_intro /CONCEPT.md
lenzcom's picture
Upload folder using huggingface_hub
e706de2 verified

Concept: Basic LLM Interaction

Overview

This example introduces the fundamental concepts of working with a Large Language Model (LLM) running locally on your machine. It demonstrates the simplest possible interaction: loading a model and asking it a question.

What is a Local LLM?

A Local LLM is an AI language model that runs entirely on your own computer, without requiring internet connectivity or external API calls. Key benefits:

  • Privacy: Your data never leaves your machine
  • Cost: No per-token API charges
  • Control: Full control over model selection and parameters
  • Offline: Works without internet connection

Core Components

1. Model Files (GGUF Format)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Qwen3-1.7B-Q8_0.gguf     β”‚
β”‚   (Model Weights File)      β”‚
β”‚                             β”‚
β”‚  β€’ Stores learned patterns  β”‚
β”‚  β€’ Quantized for efficiency β”‚
β”‚  β€’ Loaded into RAM/VRAM     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • GGUF: File format optimized for llama.cpp
  • Quantization: Reduces model size (e.g., 8-bit instead of 16-bit)
  • Trade-off: Smaller size and faster speed vs. slight quality loss

2. The Inference Pipeline

User Input β†’ Model β†’ Generation β†’ Response
    ↓          ↓          ↓           ↓
 "Hello"   Context   Sampling    "Hi there!"

Flow Diagram:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Prompt  β”‚ --> β”‚ Context  β”‚ --> β”‚  Model   β”‚ --> β”‚ Response β”‚
β”‚          β”‚     β”‚ (Memory) β”‚     β”‚(Weights) β”‚     β”‚  (Text)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. Context Window

The context is the model's working memory:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Context Window                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ System Prompt (if any)          β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚ User: "do you know node-llama?" β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚ AI: "Yes, I'm familiar..."      β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚ (Space for more conversation)   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Limited size (e.g., 2048, 4096, or 8192 tokens)
  • When full, old messages must be removed
  • All previous messages influence the next response

How LLMs Generate Responses

Token-by-Token Generation

LLMs don't generate entire sentences at once. They predict one token (word piece) at a time:

Prompt: "What is AI?"

Generation Process:
"What is AI?" β†’ [Model] β†’ "AI"
"What is AI? AI" β†’ [Model] β†’ "is"
"What is AI? AI is" β†’ [Model] β†’ "a"
"What is AI? AI is a" β†’ [Model] β†’ "field"
... continues until stop condition

Visualization:

Input Prompt
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Model    β”‚ β†’ Token 1: "AI"
β”‚ Processes  β”‚ β†’ Token 2: "is"
β”‚   & Predictsβ”‚ β†’ Token 3: "a"
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β†’ Token 4: "field"
                β†’ ...

Key Concepts for AI Agents

1. Stateless Processing

  • Each prompt is independent unless you maintain context
  • The model has no memory between different script runs
  • To build an "agent", you need to:
    • Keep the context alive between prompts
    • Maintain conversation history
    • Add tools/functions (covered in later examples)

2. Prompt Engineering Basics

The way you phrase questions affects the response:

❌ Poor: "node-llama-cpp"
βœ… Better: "do you know node-llama-cpp"
βœ… Best: "Explain what node-llama-cpp is and how it works"

3. Resource Management

LLMs consume significant resources:

Model Loading
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  RAM/VRAM Usage β”‚  ← Models need gigabytes
β”‚  CPU/GPU Time   β”‚  ← Inference takes time
β”‚  Memory Leaks?  β”‚  ← Must cleanup properly
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓
Proper Disposal

Why This Matters for Agents

This basic example establishes the foundation for AI agents:

  1. Agents need LLMs to "think": The model processes information and generates responses
  2. Agents need context: To maintain state across interactions
  3. Agents need structure: Later examples add tools, memory, and reasoning loops

Next Steps

After understanding basic prompting, explore:

  • System prompts: Giving the model a specific role or behavior
  • Function calling: Allowing the model to use tools
  • Memory: Persisting information across sessions
  • Reasoning patterns: Like ReAct (Reasoning + Acting)

Diagram: Complete Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Your Application                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚         node-llama-cpp Library             β”‚ β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
β”‚  β”‚  β”‚      llama.cpp (C++ Runtime)         β”‚  β”‚ β”‚
β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚ β”‚
β”‚  β”‚  β”‚  β”‚   Model File (GGUF)            β”‚  β”‚  β”‚ β”‚
β”‚  β”‚  β”‚  β”‚   β€’ Qwen3-1.7B-Q8_0.gguf       β”‚  β”‚  β”‚ β”‚
β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚ β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↕
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  CPU / GPU   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This layered architecture allows you to build sophisticated AI agents on top of basic LLM interactions.