How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ayjays132/PhillSwarm-4b",
	filename="phillswarm-4b-ollama-f16.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Phill Swarm-MoE 4B Qwen3.5 Hybrid Final

A Hugging Face compatible causal language model with an optional shared-weight swarm runtime for routing, skills, goals, scaffolding, tool planning, agentic sessions, and local app integrations.

ayjays132/PhillSwarm-4b 4.145B parameters 40 routed layers bf16 Qwen tokenizer HF remote code swarm runtime optional
Start here for coherent Ollama-style use: if Hugging Face shows a llama_cpp snippet for phillswarm-4b-ollama-f16.gguf, treat that as a raw GGUF preview only. For the full PhillSwarm system, run launch_ollama_bridge.py and use phillswarm-4b:full. The bridge preserves the custom HF model code, swarm controller, verified skills, goals, tools, and vision-sidecar path behind Ollama-compatible APIs.

Recommended Use Paths

User Goal Recommended Path Why
Best local Python/HF quality AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True) Loads the native custom Swarm-MoE architecture.
Coherent Ollama-compatible use python launch_ollama_bridge.py --model . --port 11435 then use phillswarm-4b:full Keeps the full runtime while exposing Ollama-style /api/chat and /api/generate.
Quick raw GGUF experiment llama_cpp / stock Ollama with phillswarm-4b-ollama-f16.gguf Loadability preview only; not the full intelligence path.

Quick Ollama-compatible full-runtime setup:

huggingface-cli download ayjays132/PhillSwarm-4b --local-dir PhillSwarm-4b
cd PhillSwarm-4b
sh setup_phillswarm_ollama.sh

If you prefer making it executable first:

chmod +x setup_phillswarm_ollama.sh
./setup_phillswarm_ollama.sh

Windows PowerShell:

huggingface-cli download ayjays132/PhillSwarm-4b --local-dir PhillSwarm-4b
cd PhillSwarm-4b
.\setup_phillswarm_ollama.ps1

Windows CMD:

huggingface-cli download ayjays132/PhillSwarm-4b --local-dir PhillSwarm-4b
cd PhillSwarm-4b
setup_phillswarm_ollama.cmd

The setup script:

  • sets user OLLAMA_HOST=http://127.0.0.1:11435
  • starts the PhillSwarm full-runtime bridge if it is not already running
  • waits until the bridge is ready
  • runs ollama list to confirm the direct Ollama CLI sees the full model
  • on macOS/Linux, adds OLLAMA_HOST to .zshrc or .bashrc when it can detect the active shell

After setup, open a new terminal and use normal Ollama commands:

ollama list
ollama run phillswarm-4b:full

Model name:

phillswarm-4b:full

What This Is

Phill Swarm-MoE is a sparse Mixture-of-Experts causal language model packaged as a normal Hugging Face checkpoint. It can be loaded as a standard AutoModelForCausalLM model, or used through the optional PhillSwarmController runtime for agentic features.

Public model repo:

ayjays132/PhillSwarm-4b
https://huggingface.co/ayjays132/PhillSwarm-4b

The checkpoint in this folder is the grown 4B final package:

  • Parameters: 4,144,993,832.
  • Architecture: Swarm-MoE decoder-only causal LM.
  • Layers: 40 unique routed layers.
  • Hidden size: 1024.
  • Experts: 16 routed experts with top-2 routing plus shared expert path.
  • Attention: grouped-query attention with Q/K RMSNorm, optional V norm gate, RoPE, KV cache.
  • Tokenizer: Qwen tokenizer copied into the final package.
  • Precision target: bf16.
  • Context configured: 4096 positions.
Plain truth: the normal model API remains standard. Swarm mode, goals, app streaming, tools, scaffolding, and online learning are wrapper/runtime features. They do not require custom `forward(goal=...)` or automatic app startup.

What Makes It Different

Shared-Weight Swarm

Planner, solver, verifier, domain, tool, and editor roles can run over one loaded model instead of separate model copies.

Verified Skills

Math, tool routing, web/search planning, browser mode, IDE/CLI integration, health/legal/finance/security domain policy, and runtime diagnostics can anchor answers.

Agentic Goals

Runtime-only goal state tracks objective, constraints, allowed tools, notes, artifacts, events, and completion status in portable JSON.

Learnable Scaffolding

Scaffold routing learns from successful traces through `scaffold_blueprint.json` without mutating model weights during normal generation.

Final Polish Pass

A safe post-processor can improve wording using only the user prompt and verified final answer. Bad polish is rejected.

IDE And CLI Friendly

Runtime metadata advertises JSON-schema tools, smolagents, MCP-compatible tools, and OpenAI-style tool calls.

Phill Swarm MoE feature card

Shared-Weight Swarm MoE

One loaded checkpoint can drive routed roles, verified skills, and final synthesis without spawning separate model copies.

Goal-aware swarm runtime feature card

Goal-Aware Runtime

Objectives, constraints, progress, tools, artifacts, and verification events stay in portable runtime JSON.

Tool-native local agent feature card

Tool-Native Local Agent

Tool routing, web search, browser observation, coding, workspace search, and safety gates are exposed through the optional app/runtime.

Quick Start: Load As A Normal HF Model

Install current transformers, then load with trust_remote_code=True because the package uses custom Swarm-MoE model code:

pip install -U transformers accelerate safetensors torch
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

model_id = "ayjays132/PhillSwarm-4b"

config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    config=config,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
).eval()

messages = [{"role": "user", "content": "Explain a black hole in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=128, do_sample=False)
print(tokenizer.decode(out[0, inputs.shape[-1]:], skip_special_tokens=True))

For a local clone or downloaded snapshot, replace model_id with the folder path, for example "." from inside the model folder.

Loading note: use `trust_remote_code=True` for `AutoConfig`, `AutoTokenizer`, and `AutoModelForCausalLM`. Without it, Transformers will not know how to construct the custom `swarm_moe` architecture. The optional app/runtime also needs the snapshot files locally available so Python can import the bundled `swarm_moe_model` package.

Quick Start: Use Swarm Runtime

The optional swarm runtime is packaged with the model files. For the cleanest setup, download the snapshot locally so Python can import the included swarm_moe_model runtime package:

pip install -U huggingface_hub transformers accelerate safetensors torch
huggingface-cli download ayjays132/PhillSwarm-4b --local-dir PhillSwarm-4b
cd PhillSwarm-4b
import sys
from pathlib import Path

sys.path.insert(0, str(Path(".").resolve()))

from swarm_moe_model.swarm_mode import PhillSwarmController

controller = PhillSwarmController.from_pretrained_or_config(".")

result = controller.ask(
    "Explain swarm mode compared with regular mode.",
    mode="swarm",
)

print(result.answer)
print(result.indicators["route_decision"]["final_selected"])

If you are already importing the packaged runtime from another location, you can point it at the public repo ID:

controller = PhillSwarmController.from_pretrained_or_config("ayjays132/PhillSwarm-4b")

Regular Mode Vs Swarm Mode

Mode What It Does Best For
Regular HF model Plain causal LM generation through AutoModelForCausalLM Standard inference, benchmark compatibility, simple integration
Regular runtime mode Single model pass with optional verified skill routing Fast local chat, CLI usage, stable known tasks
Swarm mode Runtime orchestration around the same loaded model Tool planning, goals, research-style tasks, browser/app workflows, route traces
Forced profile debate Runs dynamic workers and verification trace Debugging orchestration, comparing candidates, showing solver/critic/final-editor behavior

Default publish-facing behavior is conservative: verified skills can finish the answer without forcing noisy profile generation. Profile workers can still be activated with return_candidates=True, explicit debate prompts, or enable_profile_generation: true.

Runtime Architecture

flowchart LR
    U["User Prompt"] --> R["Intent + Skill Router"]
    R --> S["Verified Skills"]
    R --> G["Goal State"]
    R --> C["Learnable Scaffold Blueprint"]
    S --> A["Answer Composer"]
    G --> A
    C --> W["Dynamic Shared-Weight Workers"]
    W --> N["Bounded Note Pool"]
    N --> A
    A --> P["Safe Final Polish"]
    P --> O["Final Answer"]

If Mermaid does not render on your viewer, the flow is: prompt -> router -> skills/goals/scaffold/workers -> compact evidence -> final answer -> safe polish.

Goals

Goals are runtime-only and stored as JSON. They do not alter the model API.

import sys
from pathlib import Path

sys.path.insert(0, str(Path(".").resolve()))

from swarm_moe_model.swarm_mode import PhillSwarmController, SwarmGoal

controller = PhillSwarmController.from_pretrained_or_config(".")
goal = SwarmGoal(
    objective="Draft a small CLI plan for using this model with tool calls.",
    constraints=["Keep it cross-platform", "Do not assume a hardcoded path"],
    success_criteria=["Shows install", "Shows run", "Mentions permissions"],
    allowed_tools=["filesystem_read", "web_search"],
)

run = controller.run_goal(goal)
print(run.final_answer)
run.state.to_json("goal_state.json")

Agentic Sessions

Agentic sessions keep multi-turn work coherent without appending every old token forever.

  • Latest user prompt remains the authority.
  • Recent turns stay in a rolling window.
  • Older turns flush into compact summaries.
  • Tool results, route decisions, and artifacts stay as metadata.
  • KV cache is used for the active generation window, not falsely persisted across independent turns.
session = controller.create_session("workspace-task")
print(session.ask("Remember that we want a portable CLI setup.", mode="swarm").answer)
print(session.ask("Now give the final install checklist.", mode="swarm").answer)
session.state.to_json("session_state.json")

Skills And Domain Routing

The runtime includes compact verified skills and route anchors. They are designed to reduce prompt overload: the model sees only the selected route and a few verified evidence anchors, not the entire tool registry.

Current route families include:

  • general chat and identity
  • math and arithmetic
  • coding and repo workflow
  • web search planning
  • browser/vision/tool operation planning
  • research and citation discipline
  • dynamic routing diagnostics
  • training dataset and model finalization guidance
  • runtime debugging and context-overload repair
  • science explanation anchors
  • swarm-mode architecture explanation
  • AGI/runtime architecture policy
  • life-domain policy: health, legal, finance, education, creative, productivity, data, security
  • IDE/CLI integration: Cursor, VS Code, terminals, smolagents, JSON-schema tools, MCP-compatible tools, OpenAI-style tool calls
High-stakes use: health, legal, finance, and security routes are policy and safety anchors, not substitutes for qualified professional advice or permissioned security review.

Tool And App Integration

swarm_runtime_config.json advertises:

{
  "tool_protocols": ["json_schema", "smolagents", "mcp_compatible", "openai_tool_calls"],
  "external_host_compatible": true,
  "compact_tool_manifest": true
}

The intended pattern is:

  1. Host app or IDE supplies a compact tool manifest.
  2. Runtime routes the prompt to the smallest relevant tool set.
  3. Tool call is proposed as structured JSON.
  4. Permission layer approves or blocks it.
  5. Observation is returned to the model as compact evidence.
  6. Final answer cites what was actually observed.

Normal model loading does not execute tools and does not start the app.

Local App And Streaming

The app is packaged but disabled by default in config.json:

{
  "swarm_app_enabled": false,
  "swarm_app_config": "swarm_app_config.example.json"
}

Launch it explicitly from the model folder:

cd PhillSwarm-4b
python launch_swarm_app.py --install

On Windows, if python is not on PATH:

cd PhillSwarm-4b
py -3 launch_swarm_app.py --install

That one command installs the optional app extras when missing:

  • ddgs for web search
  • playwright plus Chromium for browser observe/verify actions
  • npm dependencies for packaged TypeScript tools when npm is available

From a source checkout, use:

python scripts/launch_swarm_app.py --install --config configs/phill_swarm_app.json

The browser tool is visible by default (browser_headless: false) so users can see what the agent is doing. To save memory or run on a server:

python launch_swarm_app.py --install --headless

Equivalent environment override:

PHILLNET_BROWSER_HEADLESS=1 python launch_swarm_app.py

On Windows PowerShell:

$env:PHILLNET_BROWSER_HEADLESS="1"; py -3 launch_swarm_app.py

Useful setup flags:

  • --no-browser-install skips Playwright browser download.
  • --no-npm-install skips npm tool dependency install.
  • --visible-browser forces headed Playwright windows.
  • --capture-dir state/browser_captures changes browser snapshot storage.
  • --permission-mode default keeps safe read/search/observe actions only.
  • --permission-mode yolo enables stronger browser/tool actions with tracing.

When launched explicitly through the wrapper/app server, it can expose:

  • /api/status
  • /api/chat
  • /api/chat/stream
  • /api/tools/route
  • /api/tool/call

Streaming uses Server-Sent Events for route, profile, critic, tool, goal, token, final, and error events. Tool execution remains permission-gated.

Ollama-Compatible Full Runtime

PhillSwarm includes an Ollama-compatible full-runtime bridge. This is the recommended Ollama path when you want coherent PhillSwarm behavior.

Recommended for Ollama users: run the packaged bridge and use model name phillswarm-4b:full. This keeps the full HF checkpoint, swarm controller, verified skills, tools, goals, and vision-sidecar runtime available behind Ollama-style APIs.

Why this exists: PhillSwarm is not only a plain GGUF transformer. It uses bundled HF remote code, shared-expert routing, gated V-norm behavior, a Python swarm controller, goals, app tools, and a vision sidecar. Stock Ollama/llama.cpp does not execute those Python runtime systems inside a .gguf file. The bridge keeps Ollama-style compatibility while preserving the full model system instead of flattening it into a weaker preview.

Run The Coherent Ollama Path

Download or clone the snapshot, then run the one-time setup from inside the model folder.

macOS/Linux:

sh setup_phillswarm_ollama.sh

Optional executable form:

chmod +x setup_phillswarm_ollama.sh
./setup_phillswarm_ollama.sh

Windows CMD:

setup_phillswarm_ollama.cmd

Windows PowerShell:

.\setup_phillswarm_ollama.ps1

The setup scripts are cross-OS and do the same job:

  • set OLLAMA_HOST=http://127.0.0.1:11435
  • start launch_ollama_bridge.py --model . --port 11435 if the bridge is not already running
  • wait for /api/tags to respond
  • run ollama list so the user can confirm phillswarm-4b:full is visible

Manual fallback for any OS:

python launch_ollama_bridge.py --model . --port 11435

Then set the current terminal:

export OLLAMA_HOST=http://127.0.0.1:11435

Windows PowerShell manual fallback:

$env:OLLAMA_HOST="http://127.0.0.1:11435"

Then use this model name from Ollama-compatible clients:

phillswarm-4b:full

If using an Ollama-style HTTP client, call the bridge directly:

GET  /api/tags
POST /api/show
POST /api/generate
POST /api/chat

Example:

curl http://127.0.0.1:11435/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model":"phillswarm-4b:full","stream":false,"messages":[{"role":"user","content":"What is 2+2? Answer in one sentence."}]}'

Observed smoke output:

{
  "message": {"role": "assistant", "content": "2+2 = 4."},
  "phill": {
    "full_runtime": true,
    "indicators": {"bridge": "ollama-compatible-fast-verified-skill"}
  }
}

Runtime Modes

Path Status Best Use
phillswarm-4b:full bridge Recommended and coherent Ollama-compatible clients that should use the real PhillSwarm runtime.
HF trust_remote_code=True Recommended and native Python/Transformers users who want direct model and controller access.
GGUF preview Experimental loadability preview Testing tokenizer/shape compatibility in stock Ollama, not full-quality intelligence.

What The Bridge Preserves

  • HF trust_remote_code=True model loading.
  • PhillSwarmController swarm/regular routing.
  • App tool routing, permission checks, goals, indicators, and web/search/browser integration path.
  • Vision sidecar availability from the HF snapshot.
  • Ollama-compatible JSON and NDJSON streaming response shapes.
  • Fast verified-skill routing before heavy model generation when a deterministic answer is already known.
  • Lazy loading, so /api/tags and /api/show respond quickly while the BF16 model loads only for real generation.
About the GGUF preview: the raw GGUF can be loadable in stock Ollama, but it is not the full intelligence path because stock Ollama cannot run the custom swarm runtime. For coherent outputs, use the bridge or the native HF runtime.

MCP / IDE Bridge

The app exposes a local MCP-style bridge for external agent hosts and IDEs:

JSON-RPC MCP endpoint: http://127.0.0.1:7860/mcp
Manifest:              http://127.0.0.1:7860/api/mcp/manifest
Tools:                 http://127.0.0.1:7860/api/mcp/tools
Route prompt:          http://127.0.0.1:7860/api/mcp/route
Call tool:             http://127.0.0.1:7860/api/mcp/call
Chat through app:      http://127.0.0.1:7860/api/mcp/chat
Well-known manifest:   http://127.0.0.1:7860/.well-known/phill-swarm-mcp.json

Use it with Codex, Claude Code, Cursor, Antigravity-style hosts, or any local MCP/HTTP client that can call a JSON-RPC tool server. The endpoint supports:

  • initialize
  • tools/list
  • tools/call
  • phill/route
  • phill/chat

The bridge is private by default because the app binds to 127.0.0.1. To expose it to another machine on your LAN, launch explicitly:

python launch_swarm_app.py --host 0.0.0.0

Then open /api/mcp/status to see the LAN URL. Only use LAN mode on a trusted network. For shared machines or public networks, set mcp_auth_token in swarm_app_config.example.json and send Authorization: Bearer <token> from the client.

For ChatGPT-style use, the intended pattern is different from Codex/Cursor: ChatGPT can remain the language model while calling Phill's app routes for scaffolding, routing, goals, tools, browser observation, and verification. That lets the app act as a local swarm/tool runtime without replacing the external model.

Vision Sidecar

This package includes runtime metadata for an optional Qwen3.5-style vision sidecar:

  • vision_sidecar_enabled: true
  • vision_sidecar_path: "vision_sidecar"
  • vision_snapshot_policy: "retain_latest_only"

Vision is runtime sidecar behavior, not ordinary text-generation behavior. The text embedding table is not resized for vision marker tokens; pixel tensors and browser snapshots route through external processor metadata/sidecar paths.

Learnable Scaffolding

Learnable scaffolding uses scaffold_blueprint.json.

It stores:

  • signal weights
  • node weights
  • compact learned tidbits
  • examples seen
  • update timestamp

This is zero-extra-model-weight runtime memory. It improves scaffold node selection and confidence without mutating model weights during normal generation.

{
  "learnable_scaffolding": true,
  "scaffold_blueprint_path": "scaffold_blueprint.json",
  "inject_scaffold_into_prompt": false
}

Training Losses

The model includes optional auxiliary losses for train-time routing/scaffold behavior:

{
  "router_aux_loss_coef": 0.01,
  "router_z_loss_coef": 0.001,
  "router_entropy_loss_coef": 0.0001,
  "router_confidence_loss_coef": 0.0,
  "thinking_consistency_loss_coef": 0.0001,
  "scaffold_alignment_loss_coef": 0.0001
}

These are active during training when labels are provided. They are not extra inference-time model copies.

Guarded Online Learning

Online learning support exists, but is disabled by default.

{
  "online_learning_enabled": false,
  "online_learning_lr": 1e-7,
  "online_learning_max_grad_norm": 0.05,
  "online_learning_train_top_layers": 2,
  "online_learning_min_trust": 0.25,
  "online_learning_max_updates": 32
}

When explicitly enabled and called through learn_from_correction(...), it:

  • adapts only a tiny top-layer/head surface
  • deduplicates shared tensors
  • clips gradients
  • tracks temporal trust
  • probes counterfactual loss
  • reverts failed updates

This is experimental and should be used only for approved corrections or controlled local adaptation.

Final Polish Pass

Final polish is enabled in safe mode:

{
  "enable_final_polish": true,
  "final_polish_mode": "safe"
}

It receives only the latest user prompt and the verified final answer. It cannot see raw worker notes or rejected candidates. If the polish drifts, changes numbers, becomes too short/long, or loses overlap with the verified answer, the runtime keeps the original verified answer.

Validation Summary

From the included reports:

  • HF config/tokenizer/model load passed.
  • CUDA forward passed.
  • Controller loaded with 40 routed layers.
  • No-profile swarm smoke returned a verified black-hole answer through science_explanation.
  • Dynamic profile test showed noisy raw workers are rejected and verified skills preserve the answer.
  • Final polish test preserved verified answers when the polish attempt failed validation.
  • Learnable scaffold test saved/reloaded blueprint state.
  • Auxiliary-loss tiny-model test passed forward/loss/backward.

The package includes detailed reports:

  • QWEN35_4B_FINAL_REPORT.md
  • QWEN35_4B_COHERENCE_REPORT.md
  • DYNAMIC_SWARM_ORCHESTRATION_REPORT.md
  • AGENTIC_SESSION_RUNTIME_REPORT.md
  • AGI_SKILL_ROUTE_EXPANSION_REPORT.md
  • LEARNABLE_SCAFFOLDING_REPORT.md
  • LEARNABLE_LOSSES_AND_ONLINE_LEARNING_REPORT.md
  • FINAL_POLISH_PASS_REPORT.md

Known Limits

  • Raw direct generation can still be weaker than verified runtime answers.
  • Profile generation is not enabled by default because raw worker text can be noisy.
  • Online learning is disabled by default and should not be treated as automatic safe self-training.
  • Vision sidecar is runtime behavior; normal text generation does not become a full browser-vision agent by itself.
  • This README describes implemented local runtime features, not independent benchmark superiority over frontier commercial systems.

File Map

File Purpose
config.json HF model config and passive runtime metadata
model.safetensors model weights
configuration_swarm_moe.py HF config remote-code file
modeling_swarm_moe.py HF model remote-code file
tokenizer.json, tokenizer_config.json tokenizer assets
swarm_runtime_config.json wrapper/runtime config
scaffold_blueprint.json learnable scaffold runtime memory
swarm_moe_model/ optional local runtime package
vision_sidecar.py, vision_sidecar/ optional runtime vision sidecar
swarm_app_config.example.json app config example

Recommended Use Cases

  • Local research into sparse MoE routing and shared-weight agent orchestration.
  • Tool-aware chat wrappers where tool execution is explicit and permissioned.
  • IDE/CLI assistants that need compact tool manifests and traceable routes.
  • Agentic task runners that need JSON goal state, session memory, and recoverable progress.
  • Experiments with scaffold learning and safe online correction workflows.
  • Educational exploration of MoE, routing losses, and wrapper-based agent design.

Minimal Requirements

  • Python environment with PyTorch and Transformers.
  • trust_remote_code=True.
  • bf16-capable CUDA is recommended for the 4B package.
  • CPU loading may be possible but will be slow.

Attribution And Development Notes

Phill Swarm-MoE is a custom project by Phillip A. Holland / Ayjays132. This package contains a grown hybrid checkpoint and runtime code intended for local experimentation, HF-style loading, and publishable inspection. It is built to be transparent about what is model behavior, what is runtime orchestration, and what is experimental.

Core principle: keep the model loadable as a normal HF checkpoint, then let users opt into the swarm runtime when they want goals, tools, scaffolds, streaming, sessions, and traceable orchestration.
Downloads last month
11
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support