README.md · HiMind/PackedLLM at main

File size: 11,054 Bytes

---

license: other
language:
- en
tags:
- routing-of-experts
- compound-ai
- multi-expert
- agentic
- code-execution
- web-search
- persistent-memory
- persona
- chain-of-thought
- llm-orchestration
- llama-cpp
- gguf
- pytorch
- packed-model
pipeline_tag: text-generation
---

# PackedLLM

**~10B total parameters · ~3B active per inference · Routing-of-Experts (RoE) architecture**

PackedLLM is a self-contained multi-expert language model system built around a **Routing-of-Experts (RoE)** mechanism. Rather than mixing expert outputs at the token level inside a shared transformer (Mixture-of-Experts), PackedLLM routes each request — and each stage of a multi-stage reasoning pipeline — to a dedicated, fully independent specialist model. At most one or two experts are active simultaneously, keeping peak memory around 3B parameters regardless of the 10B total footprint.

The system runs entirely on consumer hardware via llama.cpp persists its full state to a single ZIP checkpoint, and integrates persistent vector memory, sandboxed Python execution, and multi-engine web search as first-class pipeline citizens.

---

## Architecture overview

```

PackedLLMRunner          ← user-facing shell: load, warmup, lifecycle

      │

PackedLLM (PackedLLM.pt) ← 9-stage orchestration pipeline

      │

Expert dispatch layer    ← 10 specialist models, one active at a time

      │

 ┌────┴─────────────────────────────┐

 │ GATOR/MemoryBank   CodeBox   Web │  ← integrated modules

 └────┬─────────────────────────────┘

      │

PackedLM (LM.pt)         ← llama.cpp inference engine + ExpertHandles

```
<img src="images/packedllm_architecture.png" alt="PackedLLM Architecture" width="1000">

---

## How it differs from MoE

| | Standard MoE (Mixtral, DeepSeek) | PackedLLM RoE |
|---|---|---|
| **Routing granularity** | Per-token, inside every transformer layer | Per-task and per-pipeline-stage |
| **What gets routed** | FFN sub-modules sharing one transformer | Separate, fully independent specialist LLMs |
| **Parameters active** | Top-K experts × FFN size, across all layers | One expert at a time (~3B peak) |
| **Router mechanism** | Learned linear gating vector | `HeadExpert` — a full LLM returning JSON |
| **Experts share weights?** | Yes (all attention layers are always shared) | No — complete independence |
| **Pipeline** | Single transformer forward pass | 9-stage: plan → route → execute → synthesize → persona → review |

The closest related work is **Composition of Experts** (Chai et al., 2024, arXiv:2412.01868), which also routes at the input level to full LLM models. PackedLLM extends this with a multi-stage orchestration pipeline, per-stage retry/detour recovery, integrated memory and execution modules, affective state modelling, and a character persona layer — none of which appear in prior systems of this type.

---

## Pipeline stages

Every call to `forward()` passes through these stages in order:

| Stage | Expert | Temperature | What it does |
|---|---|---|---|
| 1. Plan goal | HeadExpert | 0.2 | Parse intent, tone, routing flags (needs_web / needs_action / needs_vision) |

| 2. Consult memory | GATOR | — | Match registered commands; retrieve relevant memory |

| 3. Build route | HeadExpert | 0.5 | Generate ordered list of expert steps as JSON |

| 4. Execute route | various | per-expert | Dispatch each step; retry / detour / skip on failure |

| 5. Synthesize base | HeadExpert | 1.0 | Combine step outputs into a persona-free prose answer |

| 6. Affective state | AffectExpert | 0.5 | Generate bot emotional + physical state JSON |

| 7. Apply persona | RoleExpert | 1.0 | Rewrite base response in character |

| 8. Review | HeadExpert | 0.0 | Accept / revise / reject; extract memory facts; profile updates |

| 9. Finalize | GATOR | — | Write memory, update user/bot profiles |



---



## Expert roster



| Expert | Active params | Role | Notes                                     |

|---|---|---|-------------------------------------------|

| HeadExpert | ~3B | Orchestrator, router, planner, synthesizer, reviewer | Most-called expert     |

| LogicExpert | ~1B | Structured reasoning; deep-think CoT; action planning/repair | raw completion with `<think>` blocks |

| CodeExpert | ~1B | Python script generation for action pipeline | Temperature 0.0; raw code only, no prose  |

| MathExpert | ~1B | Quantitative reasoning | Post-processes CJK spans; deduplicates repeated lines |

| AffectExpert | ~0.5B | Emotional state; step quality evaluation | Used as both emotion classifier and pass/fail judge |

| RoleExpert | ~0.5B | Persona rewriting in character | RP style chat format                      |

| CreativeExpert | ~1B | Writing and stylistic generation | High temperature defaults (0.9)           |

| VisionExpert | ~1B | Multimodal image understanding | CLIP projector; local images → data URI   |

| ToolExpert | ~0.5B | Function-call generation | outputs `{"tool_calls": [...]}` JSON      |
| TranslationExpert | ~300M | Chinese → English | seq2seq — not an LLM; Chinese regex gate  |

**Total: ~10B · Peak active: ~3B**

---

## Forward modes

### Standard (full pipeline)
```python

bot.chat("What is the compound interest on $5000 at 4% over 10 years?")

```
All 9 stages. Memory read/write. Web and action pipelines if needed.

### Fast think (minimum latency)
```python

bot.chat("What time is it in Tokyo?", fast_think=True)

```
Skips planning, routing, memory, web, action, affective state, review. HeadExpert answers directly; RoleExpert applies persona if a bot profile exists. Maximum 2 LLM calls.

### Deep think (CoT scaffolding)
```python

bot.chat("Design a Python caching decorator with TTL support.", deep_think=True)

```
Before each pipeline stage, `LogicExpert` generates `<think>...</think>` blocks scoped to that stage's specific task and output contract. These blocks are prepended to the stage's prompt as if the executing expert had already done that prior reasoning. Blocks are cached within a single `forward()` call. Translation is excluded (not an LLM).

---

## Integrated modules

### MemoryBank (`GATOR.pt`)
A multi-tree semantic store built on `PackedTree` — a custom embedding + KMeans clustering retrieval structure. Trees: knowledge, conversation, user profiles, bot profiles, commands, assets, telemetry. Hybrid retrieval scoring: 75% semantic similarity + 20% keyword overlap + 5% importance metadata. Embedding model: Jina Embeddings v3 (GGUF, stored inside the checkpoint). GATOR's own action planner uses `HeadExpert` to decide which memory operations to run. Also contains `DesktopControl` (OS automation) and `CommandRegistry` (text-to-action macros).

### CodeBox (`CodeBox.pt`)
Persistent Python sandbox with isolated virtual environment management, SHA256-verified asset registry, loader injection (`from _codebox_loader import load_asset` inside sandboxed code), DAG pipeline runner with `$var` reference passing between steps, LRU runner cache for expensive models, and hard RAM/CPU kill thresholds enforced by a monitoring thread.

### Web (`WebSearch.pt`)
Three search engines (DuckDuckGo HTML, Google, ResultHunter) with embedding-ranked candidate deduplication. Content extraction tries 10 methods: YouTube transcripts → trafilatura → boilerpy3 → readability → newspaper3k → goose3 → inscriptis → lxml → BeautifulSoup → visible text. PDF via PyMuPDF. Summarization via DistilBART. Runs in a separate spawned process; communicates via `multiprocessing.Queue`. Serializes safely — live process handles are stripped on save.

---

## Usage

### Basic
```python

from PackedLLM import PackedLLMRunner



bot = PackedLLMRunner("PackedLLM.pt", bot_id="pip", user_id="alice")

print(bot.chat("Explain gradient descent in one paragraph."))

```

### Expert shortcuts (bypass full pipeline)
```python

bot.creative("Write a haiku about a robot discovering music.")

bot.code("Implement binary search in Python with comments.")

bot.math("Solve: integral of x² · sin(x) dx")

bot.logic("All A are B. Some B are C. What follows?", mode="deep_then_answer")

bot.translate("人工智能正在改变世界")

bot.web("Latest developments in solid-state batteries?")

bot.action("Compute compound interest on $5000 at 4% over 10 years; save to report.txt")

```

### Memory
```python

bot.memory_store("User prefers concise answers under 100 words.")

results = bot.memory_recall("answer preferences", top_k=3)

bot.set_user_profile({"name": "Alice", "expertise": "ML"})

bot.set_bot_profile({"character_card": "You are Pip, a direct and slightly sarcastic assistant."})

```

### Lifecycle
```python

bot.unload_expert("vision_expert")   # free VRAM; reloads lazily on next use

bot.reload_expert("code_expert")     # hot-reload after checkpoint update

print(bot.status())                  # full system diagnostic



# Context manager (auto-unload on exit)

with PackedLLMRunner("PackedLLM.pt", bot_id="pip", user_id="alice") as bot:

    print(bot.chat("Summarise the Pythagorean theorem."))

```

---

## Checkpointing

`PackedLLM.pt` is a ZIP archive containing:
- `manifest.pt` — all metadata, profiles, hardware state, embedded source code
- `lm_chunk_N.bin` — model weights in 32MB streaming chunks
- `mem_chunk_N.bin` — GATOR memory store chunks
- `web_chunk_N.bin` — WebSearch module chunks
- `box_chunk_N.bin` — CodeBox chunks


---

## Hardware

PackedLM detects and uses CUDA, Apple Metal (MPS), WebGPU, or CPU automatically via `HardwareProbe`. For each expert, `_plan_offload()` estimates the GGUF file size and computes how many transformer layers can fit in free VRAM (with a 15% safety margin for CUDA, 40% for WebGPU). If VRAM is insufficient for a full offload, layers are split proportionally between GPU and CPU.

---

## Citation

```bibtex

@software{packedllm2026,

  Author     = {Chance Brownfield},

  title     = {PackedLLM: A Routing-of-Experts System with LLM-Orchestrated Execution Pipeline},

  year      = {2026},

  note      = {RoE architecture: task-level routing to fully independent specialist LLMs.

               Distinct from token-level Mixture-of-Experts.

               Integrates persistent vector memory (GATOR), sandboxed Python execution (CodeBox),

               and multi-engine web search in a 9-stage orchestration pipeline.}

}

```

---

## License

This project is licensed under PackedLicense v1.0.

Free for personal, educational, research, and other non-commercial use.

Commercial use requires prior written authorization.

The GATOR, WebSearchModule, and CodeBox components are protected under this license and may not be extracted, redistributed, or commercially reused without authorization.