# Concept: Basic LLM Interaction

## Overview

This example introduces the fundamental concepts of working with a Large Language Model (LLM) running locally on your machine. It demonstrates the simplest possible interaction: loading a model and asking it a question.

## What is a Local LLM?

A **Local LLM** is an AI language model that runs entirely on your own computer, without requiring internet connectivity or external API calls. Key benefits:

- **Privacy**: Your data never leaves your machine
- **Cost**: No per-token API charges
- **Control**: Full control over model selection and parameters
- **Offline**: Works without internet connection

## Core Components

### 1. Model Files (GGUF Format)

```
┌─────────────────────────────┐
│   Qwen3-1.7B-Q8_0.gguf     │
│   (Model Weights File)      │
│                             │
│  • Stores learned patterns  │
│  • Quantized for efficiency │
│  • Loaded into RAM/VRAM     │
└─────────────────────────────┘
```

- **GGUF**: File format optimized for llama.cpp
- **Quantization**: Reduces model size (e.g., 8-bit instead of 16-bit)
- **Trade-off**: Smaller size and faster speed vs. slight quality loss

### 2. The Inference Pipeline

```
User Input → Model → Generation → Response
    ↓          ↓          ↓           ↓
 "Hello"   Context   Sampling    "Hi there!"
```

**Flow Diagram:**
```
┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│  Prompt  │ --> │ Context  │ --> │  Model   │ --> │ Response │
│          │     │ (Memory) │     │(Weights) │     │  (Text)  │
└──────────┘     └──────────┘     └──────────┘     └──────────┘
```

### 3. Context Window

The **context** is the model's working memory:

```
┌─────────────────────────────────────────┐
│           Context Window                │
│  ┌─────────────────────────────────┐   │
│  │ System Prompt (if any)          │   │
│  ├─────────────────────────────────┤   │
│  │ User: "do you know node-llama?" │   │
│  ├─────────────────────────────────┤   │
│  │ AI: "Yes, I'm familiar..."      │   │
│  ├─────────────────────────────────┤   │
│  │ (Space for more conversation)   │   │
│  └─────────────────────────────────┘   │
└─────────────────────────────────────────┘
```

- Limited size (e.g., 2048, 4096, or 8192 tokens)
- When full, old messages must be removed
- All previous messages influence the next response

## How LLMs Generate Responses

### Token-by-Token Generation

LLMs don't generate entire sentences at once. They predict one **token** (word piece) at a time:

```
Prompt: "What is AI?"

Generation Process:
"What is AI?" → [Model] → "AI"
"What is AI? AI" → [Model] → "is"
"What is AI? AI is" → [Model] → "a"
"What is AI? AI is a" → [Model] → "field"
... continues until stop condition
```

**Visualization:**
```
Input Prompt
     ↓
┌────────────┐
│   Model    │ → Token 1: "AI"
│ Processes  │ → Token 2: "is"
│   & Predicts│ → Token 3: "a"
└────────────┘ → Token 4: "field"
                → ...
```

## Key Concepts for AI Agents

### 1. Stateless Processing
- Each prompt is independent unless you maintain context
- The model has no memory between different script runs
- To build an "agent", you need to:
  - Keep the context alive between prompts
  - Maintain conversation history
  - Add tools/functions (covered in later examples)

### 2. Prompt Engineering Basics
The way you phrase questions affects the response:

```
❌ Poor: "node-llama-cpp"
✅ Better: "do you know node-llama-cpp"
✅ Best: "Explain what node-llama-cpp is and how it works"
```

### 3. Resource Management
LLMs consume significant resources:

```
Model Loading
     ↓
┌─────────────────┐
│  RAM/VRAM Usage │  ← Models need gigabytes
│  CPU/GPU Time   │  ← Inference takes time
│  Memory Leaks?  │  ← Must cleanup properly
└─────────────────┘
     ↓
Proper Disposal
```

## Why This Matters for Agents

This basic example establishes the foundation for AI agents:

1. **Agents need LLMs to "think"**: The model processes information and generates responses
2. **Agents need context**: To maintain state across interactions
3. **Agents need structure**: Later examples add tools, memory, and reasoning loops

## Next Steps

After understanding basic prompting, explore:
- **System prompts**: Giving the model a specific role or behavior
- **Function calling**: Allowing the model to use tools
- **Memory**: Persisting information across sessions
- **Reasoning patterns**: Like ReAct (Reasoning + Acting)

## Diagram: Complete Architecture

```
┌──────────────────────────────────────────────────┐
│            Your Application                      │
│  ┌────────────────────────────────────────────┐ │
│  │         node-llama-cpp Library             │ │
│  │  ┌──────────────────────────────────────┐  │ │
│  │  │      llama.cpp (C++ Runtime)         │  │ │
│  │  │  ┌────────────────────────────────┐  │  │ │
│  │  │  │   Model File (GGUF)            │  │  │ │
│  │  │  │   • Qwen3-1.7B-Q8_0.gguf       │  │  │ │
│  │  │  └────────────────────────────────┘  │  │ │
│  │  └──────────────────────────────────────┘  │ │
│  └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
           ↕
    ┌──────────────┐
    │  CPU / GPU   │
    └──────────────┘
```

This layered architecture allows you to build sophisticated AI agents on top of basic LLM interactions.