Instructions to use ayjays132/PhillSwarm-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ayjays132/PhillSwarm-4b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ayjays132/PhillSwarm-4b", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ayjays132/PhillSwarm-4b", trust_remote_code=True, dtype="auto")

llama-cpp-python

How to use ayjays132/PhillSwarm-4b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ayjays132/PhillSwarm-4b",
	filename="phillswarm-4b-ollama-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ayjays132/PhillSwarm-4b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ayjays132/PhillSwarm-4b:F16
# Run inference directly in the terminal:
llama-cli -hf ayjays132/PhillSwarm-4b:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ayjays132/PhillSwarm-4b:F16
# Run inference directly in the terminal:
llama-cli -hf ayjays132/PhillSwarm-4b:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ayjays132/PhillSwarm-4b:F16
# Run inference directly in the terminal:
./llama-cli -hf ayjays132/PhillSwarm-4b:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ayjays132/PhillSwarm-4b:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ayjays132/PhillSwarm-4b:F16

Use Docker

docker model run hf.co/ayjays132/PhillSwarm-4b:F16

LM Studio
Jan

vLLM

How to use ayjays132/PhillSwarm-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ayjays132/PhillSwarm-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ayjays132/PhillSwarm-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ayjays132/PhillSwarm-4b:F16

SGLang

How to use ayjays132/PhillSwarm-4b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ayjays132/PhillSwarm-4b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ayjays132/PhillSwarm-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ayjays132/PhillSwarm-4b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ayjays132/PhillSwarm-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use ayjays132/PhillSwarm-4b with Ollama:
```
ollama run hf.co/ayjays132/PhillSwarm-4b:F16
```

Unsloth Studio

How to use ayjays132/PhillSwarm-4b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ayjays132/PhillSwarm-4b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ayjays132/PhillSwarm-4b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ayjays132/PhillSwarm-4b to start chatting

How to use ayjays132/PhillSwarm-4b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ayjays132/PhillSwarm-4b:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ayjays132/PhillSwarm-4b:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ayjays132/PhillSwarm-4b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ayjays132/PhillSwarm-4b:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ayjays132/PhillSwarm-4b:F16

Run Hermes

hermes

Docker Model Runner
How to use ayjays132/PhillSwarm-4b with Docker Model Runner:
```
docker model run hf.co/ayjays132/PhillSwarm-4b:F16
```

Lemonade

How to use ayjays132/PhillSwarm-4b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ayjays132/PhillSwarm-4b:F16

Run and chat with the model

lemonade run user.PhillSwarm-4b-F16

List all available models

lemonade list

Phill Swarm-MoE 4B Qwen3.5 Hybrid Final

A Hugging Face compatible causal language model with an optional shared-weight swarm runtime for routing, skills, goals, scaffolding, tool planning, agentic sessions, and local app integrations.

ayjays132/PhillSwarm-4b 4.145B parameters 40 routed layers bf16 Qwen tokenizer HF remote code swarm runtime optional

Start here for coherent Ollama-style use: if Hugging Face shows a llama_cpp snippet for phillswarm-4b-ollama-f16.gguf, treat that as a raw GGUF preview only. For the full PhillSwarm system, run launch_ollama_bridge.py and use phillswarm-4b:full. The bridge preserves the custom HF model code, swarm controller, verified skills, goals, tools, and vision-sidecar path behind Ollama-compatible APIs.

Recommended Use Paths

User Goal	Recommended Path	Why
Best local Python/HF quality	`AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True)`	Loads the native custom Swarm-MoE architecture.
Coherent Ollama-compatible use	`python launch_ollama_bridge.py --model . --port 11435` then use `phillswarm-4b:full`	Keeps the full runtime while exposing Ollama-style `/api/chat` and `/api/generate`.
Quick raw GGUF experiment	`llama_cpp` / stock Ollama with `phillswarm-4b-ollama-f16.gguf`	Loadability preview only; not the full intelligence path.

Quick Ollama-compatible full-runtime setup:

huggingface-cli download ayjays132/PhillSwarm-4b --local-dir PhillSwarm-4b
cd PhillSwarm-4b
sh setup_phillswarm_ollama.sh

If you prefer making it executable first:

chmod +x setup_phillswarm_ollama.sh
./setup_phillswarm_ollama.sh

Windows PowerShell:

huggingface-cli download ayjays132/PhillSwarm-4b --local-dir PhillSwarm-4b
cd PhillSwarm-4b
.\setup_phillswarm_ollama.ps1

Windows CMD:

huggingface-cli download ayjays132/PhillSwarm-4b --local-dir PhillSwarm-4b
cd PhillSwarm-4b
setup_phillswarm_ollama.cmd

The setup script:

sets user OLLAMA_HOST=http://127.0.0.1:11435
starts the PhillSwarm full-runtime bridge if it is not already running
waits until the bridge is ready
runs ollama list to confirm the direct Ollama CLI sees the full model
on macOS/Linux, adds OLLAMA_HOST to .zshrc or .bashrc when it can detect the active shell

After setup, open a new terminal and use normal Ollama commands:

ollama list
ollama run phillswarm-4b:full

Model name:

phillswarm-4b:full

What This Is

Phill Swarm-MoE is a sparse Mixture-of-Experts causal language model packaged as a normal Hugging Face checkpoint. It can be loaded as a standard AutoModelForCausalLM model, or used through the optional PhillSwarmController runtime for agentic features.

Public model repo:

ayjays132/PhillSwarm-4b
https://huggingface.co/ayjays132/PhillSwarm-4b

The checkpoint in this folder is the grown 4B final package:

Parameters: 4,144,993,832.
Architecture: Swarm-MoE decoder-only causal LM.
Layers: 40 unique routed layers.
Hidden size: 1024.
Experts: 16 routed experts with top-2 routing plus shared expert path.
Attention: grouped-query attention with Q/K RMSNorm, optional V norm gate, RoPE, KV cache.
Tokenizer: Qwen tokenizer copied into the final package.
Precision target: bf16.
Context configured: 4096 positions.

Plain truth: the normal model API remains standard. Swarm mode, goals, app streaming, tools, scaffolding, and online learning are wrapper/runtime features. They do not require custom `forward(goal=...)` or automatic app startup.

What Makes It Different

Shared-Weight Swarm

Planner, solver, verifier, domain, tool, and editor roles can run over one loaded model instead of separate model copies.

Verified Skills

Math, tool routing, web/search planning, browser mode, IDE/CLI integration, health/legal/finance/security domain policy, and runtime diagnostics can anchor answers.

Agentic Goals

Runtime-only goal state tracks objective, constraints, allowed tools, notes, artifacts, events, and completion status in portable JSON.

Learnable Scaffolding

Scaffold routing learns from successful traces through `scaffold_blueprint.json` without mutating model weights during normal generation.

Final Polish Pass

A safe post-processor can improve wording using only the user prompt and verified final answer. Bad polish is rejected.

IDE And CLI Friendly

Runtime metadata advertises JSON-schema tools, smolagents, MCP-compatible tools, and OpenAI-style tool calls.

Shared-Weight Swarm MoE

One loaded checkpoint can drive routed roles, verified skills, and final synthesis without spawning separate model copies.

Goal-Aware Runtime

Objectives, constraints, progress, tools, artifacts, and verification events stay in portable runtime JSON.

Tool-Native Local Agent

Tool routing, web search, browser observation, coding, workspace search, and safety gates are exposed through the optional app/runtime.

Quick Start: Load As A Normal HF Model

Install current transformers, then load with trust_remote_code=True because the package uses custom Swarm-MoE model code:

pip install -U transformers accelerate safetensors torch

import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

model_id = "ayjays132/PhillSwarm-4b"

config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    config=config,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
).eval()

messages = [{"role": "user", "content": "Explain a black hole in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=128, do_sample=False)
print(tokenizer.decode(out[0, inputs.shape[-1]:], skip_special_tokens=True))

For a local clone or downloaded snapshot, replace model_id with the folder path, for example "." from inside the model folder.

Loading note: use `trust_remote_code=True` for `AutoConfig`, `AutoTokenizer`, and `AutoModelForCausalLM`. Without it, Transformers will not know how to construct the custom `swarm_moe` architecture. The optional app/runtime also needs the snapshot files locally available so Python can import the bundled `swarm_moe_model` package.

Quick Start: Use Swarm Runtime

The optional swarm runtime is packaged with the model files. For the cleanest setup, download the snapshot locally so Python can import the included swarm_moe_model runtime package:

pip install -U huggingface_hub transformers accelerate safetensors torch
huggingface-cli download ayjays132/PhillSwarm-4b --local-dir PhillSwarm-4b
cd PhillSwarm-4b

import sys
from pathlib import Path

sys.path.insert(0, str(Path(".").resolve()))

from swarm_moe_model.swarm_mode import PhillSwarmController

controller = PhillSwarmController.from_pretrained_or_config(".")

result = controller.ask(
    "Explain swarm mode compared with regular mode.",
    mode="swarm",
)

print(result.answer)
print(result.indicators["route_decision"]["final_selected"])

If you are already importing the packaged runtime from another location, you can point it at the public repo ID:

controller = PhillSwarmController.from_pretrained_or_config("ayjays132/PhillSwarm-4b")

Regular Mode Vs Swarm Mode

Mode	What It Does	Best For
Regular HF model	Plain causal LM generation through `AutoModelForCausalLM`	Standard inference, benchmark compatibility, simple integration
Regular runtime mode	Single model pass with optional verified skill routing	Fast local chat, CLI usage, stable known tasks
Swarm mode	Runtime orchestration around the same loaded model	Tool planning, goals, research-style tasks, browser/app workflows, route traces
Forced profile debate	Runs dynamic workers and verification trace	Debugging orchestration, comparing candidates, showing solver/critic/final-editor behavior

Default publish-facing behavior is conservative: verified skills can finish the answer without forcing noisy profile generation. Profile workers can still be activated with return_candidates=True, explicit debate prompts, or enable_profile_generation: true.

Runtime Architecture

flowchart LR
    U["User Prompt"] --> R["Intent + Skill Router"]
    R --> S["Verified Skills"]
    R --> G["Goal State"]
    R --> C["Learnable Scaffold Blueprint"]
    S --> A["Answer Composer"]
    G --> A
    C --> W["Dynamic Shared-Weight Workers"]
    W --> N["Bounded Note Pool"]
    N --> A
    A --> P["Safe Final Polish"]
    P --> O["Final Answer"]

If Mermaid does not render on your viewer, the flow is: prompt -> router -> skills/goals/scaffold/workers -> compact evidence -> final answer -> safe polish.

Goals

Goals are runtime-only and stored as JSON. They do not alter the model API.

import sys
from pathlib import Path

sys.path.insert(0, str(Path(".").resolve()))

from swarm_moe_model.swarm_mode import PhillSwarmController, SwarmGoal

controller = PhillSwarmController.from_pretrained_or_config(".")
goal = SwarmGoal(
    objective="Draft a small CLI plan for using this model with tool calls.",
    constraints=["Keep it cross-platform", "Do not assume a hardcoded path"],
    success_criteria=["Shows install", "Shows run", "Mentions permissions"],
    allowed_tools=["filesystem_read", "web_search"],
)

run = controller.run_goal(goal)
print(run.final_answer)
run.state.to_json("goal_state.json")

Agentic Sessions

Agentic sessions keep multi-turn work coherent without appending every old token forever.

Latest user prompt remains the authority.
Recent turns stay in a rolling window.
Older turns flush into compact summaries.
Tool results, route decisions, and artifacts stay as metadata.
KV cache is used for the active generation window, not falsely persisted across independent turns.

session = controller.create_session("workspace-task")
print(session.ask("Remember that we want a portable CLI setup.", mode="swarm").answer)
print(session.ask("Now give the final install checklist.", mode="swarm").answer)
session.state.to_json("session_state.json")

Skills And Domain Routing

The runtime includes compact verified skills and route anchors. They are designed to reduce prompt overload: the model sees only the selected route and a few verified evidence anchors, not the entire tool registry.

Current route families include:

general chat and identity
math and arithmetic
coding and repo workflow
web search planning
browser/vision/tool operation planning
research and citation discipline
dynamic routing diagnostics
training dataset and model finalization guidance
runtime debugging and context-overload repair
science explanation anchors
swarm-mode architecture explanation
AGI/runtime architecture policy
life-domain policy: health, legal, finance, education, creative, productivity, data, security
IDE/CLI integration: Cursor, VS Code, terminals, smolagents, JSON-schema tools, MCP-compatible tools, OpenAI-style tool calls

High-stakes use: health, legal, finance, and security routes are policy and safety anchors, not substitutes for qualified professional advice or permissioned security review.

Tool And App Integration

swarm_runtime_config.json advertises:

{
  "tool_protocols": ["json_schema", "smolagents", "mcp_compatible", "openai_tool_calls"],
  "external_host_compatible": true,
  "compact_tool_manifest": true
}

The intended pattern is:

Host app or IDE supplies a compact tool manifest.
Runtime routes the prompt to the smallest relevant tool set.
Tool call is proposed as structured JSON.
Permission layer approves or blocks it.
Observation is returned to the model as compact evidence.
Final answer cites what was actually observed.

Normal model loading does not execute tools and does not start the app.

Local App And Streaming

The app is packaged but disabled by default in config.json:

{
  "swarm_app_enabled": false,
  "swarm_app_config": "swarm_app_config.example.json"
}

Launch it explicitly from the model folder:

cd PhillSwarm-4b
python launch_swarm_app.py --install

On Windows, if python is not on PATH:

cd PhillSwarm-4b
py -3 launch_swarm_app.py --install

That one command installs the optional app extras when missing:

ddgs for web search
playwright plus Chromium for browser observe/verify actions
npm dependencies for packaged TypeScript tools when npm is available

From a source checkout, use:

python scripts/launch_swarm_app.py --install --config configs/phill_swarm_app.json

The browser tool is visible by default (browser_headless: false) so users can see what the agent is doing. To save memory or run on a server:

python launch_swarm_app.py --install --headless

Equivalent environment override:

PHILLNET_BROWSER_HEADLESS=1 python launch_swarm_app.py

On Windows PowerShell:

$env:PHILLNET_BROWSER_HEADLESS="1"; py -3 launch_swarm_app.py

Useful setup flags:

--no-browser-install skips Playwright browser download.
--no-npm-install skips npm tool dependency install.
--visible-browser forces headed Playwright windows.
--capture-dir state/browser_captures changes browser snapshot storage.
--permission-mode default keeps safe read/search/observe actions only.
--permission-mode yolo enables stronger browser/tool actions with tracing.

When launched explicitly through the wrapper/app server, it can expose:

/api/status
/api/chat
/api/chat/stream
/api/tools/route
/api/tool/call

Streaming uses Server-Sent Events for route, profile, critic, tool, goal, token, final, and error events. Tool execution remains permission-gated.

Ollama-Compatible Full Runtime

PhillSwarm includes an Ollama-compatible full-runtime bridge. This is the recommended Ollama path when you want coherent PhillSwarm behavior.

Recommended for Ollama users: run the packaged bridge and use model name phillswarm-4b:full. This keeps the full HF checkpoint, swarm controller, verified skills, tools, goals, and vision-sidecar runtime available behind Ollama-style APIs.

Why this exists: PhillSwarm is not only a plain GGUF transformer. It uses bundled HF remote code, shared-expert routing, gated V-norm behavior, a Python swarm controller, goals, app tools, and a vision sidecar. Stock Ollama/llama.cpp does not execute those Python runtime systems inside a .gguf file. The bridge keeps Ollama-style compatibility while preserving the full model system instead of flattening it into a weaker preview.

Run The Coherent Ollama Path

Download or clone the snapshot, then run the one-time setup from inside the model folder.

macOS/Linux:

sh setup_phillswarm_ollama.sh

Optional executable form:

chmod +x setup_phillswarm_ollama.sh
./setup_phillswarm_ollama.sh

Windows CMD:

setup_phillswarm_ollama.cmd

Windows PowerShell:

.\setup_phillswarm_ollama.ps1

The setup scripts are cross-OS and do the same job:

set OLLAMA_HOST=http://127.0.0.1:11435
start launch_ollama_bridge.py --model . --port 11435 if the bridge is not already running
wait for /api/tags to respond
run ollama list so the user can confirm phillswarm-4b:full is visible

Manual fallback for any OS:

python launch_ollama_bridge.py --model . --port 11435

Then set the current terminal:

export OLLAMA_HOST=http://127.0.0.1:11435

Windows PowerShell manual fallback:

$env:OLLAMA_HOST="http://127.0.0.1:11435"

Then use this model name from Ollama-compatible clients:

phillswarm-4b:full

If using an Ollama-style HTTP client, call the bridge directly:

GET  /api/tags
POST /api/show
POST /api/generate
POST /api/chat

Example:

curl http://127.0.0.1:11435/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model":"phillswarm-4b:full","stream":false,"messages":[{"role":"user","content":"What is 2+2? Answer in one sentence."}]}'

Observed smoke output:

{
  "message": {"role": "assistant", "content": "2+2 = 4."},
  "phill": {
    "full_runtime": true,
    "indicators": {"bridge": "ollama-compatible-fast-verified-skill"}
  }
}

Runtime Modes

Path	Status	Best Use
`phillswarm-4b:full` bridge	Recommended and coherent	Ollama-compatible clients that should use the real PhillSwarm runtime.
HF `trust_remote_code=True`	Recommended and native	Python/Transformers users who want direct model and controller access.
GGUF preview	Experimental loadability preview	Testing tokenizer/shape compatibility in stock Ollama, not full-quality intelligence.

What The Bridge Preserves

HF trust_remote_code=True model loading.
PhillSwarmController swarm/regular routing.
App tool routing, permission checks, goals, indicators, and web/search/browser integration path.
Vision sidecar availability from the HF snapshot.
Ollama-compatible JSON and NDJSON streaming response shapes.
Fast verified-skill routing before heavy model generation when a deterministic answer is already known.
Lazy loading, so /api/tags and /api/show respond quickly while the BF16 model loads only for real generation.

About the GGUF preview: the raw GGUF can be loadable in stock Ollama, but it is not the full intelligence path because stock Ollama cannot run the custom swarm runtime. For coherent outputs, use the bridge or the native HF runtime.

MCP / IDE Bridge

The app exposes a local MCP-style bridge for external agent hosts and IDEs:

JSON-RPC MCP endpoint: http://127.0.0.1:7860/mcp
Manifest:              http://127.0.0.1:7860/api/mcp/manifest
Tools:                 http://127.0.0.1:7860/api/mcp/tools
Route prompt:          http://127.0.0.1:7860/api/mcp/route
Call tool:             http://127.0.0.1:7860/api/mcp/call
Chat through app:      http://127.0.0.1:7860/api/mcp/chat
Well-known manifest:   http://127.0.0.1:7860/.well-known/phill-swarm-mcp.json

Use it with Codex, Claude Code, Cursor, Antigravity-style hosts, or any local MCP/HTTP client that can call a JSON-RPC tool server. The endpoint supports:

initialize
tools/list
tools/call
phill/route
phill/chat

The bridge is private by default because the app binds to 127.0.0.1. To expose it to another machine on your LAN, launch explicitly:

python launch_swarm_app.py --host 0.0.0.0

Then open /api/mcp/status to see the LAN URL. Only use LAN mode on a trusted network. For shared machines or public networks, set mcp_auth_token in swarm_app_config.example.json and send Authorization: Bearer <token> from the client.

For ChatGPT-style use, the intended pattern is different from Codex/Cursor: ChatGPT can remain the language model while calling Phill's app routes for scaffolding, routing, goals, tools, browser observation, and verification. That lets the app act as a local swarm/tool runtime without replacing the external model.

Phill Swarm App settings drawer and workspace

Compact Runtime SettingsMode, persona, depth, goals, route judge, streaming, tools, web search, website drafting, and permissions are available without crowding the main chat.

Real Tool Routing PreviewThe app calls the routed tools endpoint, shows the selected chain, permission blocks, activity events, and goal cards before a full model run.

Website Drafting SurfaceThe local site-draft endpoint renders a preview, updates the execution flow, and enables export while keeping actions permission-gated.

Vision Sidecar

This package includes runtime metadata for an optional Qwen3.5-style vision sidecar:

vision_sidecar_enabled: true
vision_sidecar_path: "vision_sidecar"
vision_snapshot_policy: "retain_latest_only"

Vision is runtime sidecar behavior, not ordinary text-generation behavior. The text embedding table is not resized for vision marker tokens; pixel tensors and browser snapshots route through external processor metadata/sidecar paths.

Learnable Scaffolding

Learnable scaffolding uses scaffold_blueprint.json.

It stores:

signal weights
node weights
compact learned tidbits
examples seen
update timestamp

This is zero-extra-model-weight runtime memory. It improves scaffold node selection and confidence without mutating model weights during normal generation.

{
  "learnable_scaffolding": true,
  "scaffold_blueprint_path": "scaffold_blueprint.json",
  "inject_scaffold_into_prompt": false
}

Training Losses

The model includes optional auxiliary losses for train-time routing/scaffold behavior:

{
  "router_aux_loss_coef": 0.01,
  "router_z_loss_coef": 0.001,
  "router_entropy_loss_coef": 0.0001,
  "router_confidence_loss_coef": 0.0,
  "thinking_consistency_loss_coef": 0.0001,
  "scaffold_alignment_loss_coef": 0.0001
}

These are active during training when labels are provided. They are not extra inference-time model copies.

Guarded Online Learning

Online learning support exists, but is disabled by default.

{
  "online_learning_enabled": false,
  "online_learning_lr": 1e-7,
  "online_learning_max_grad_norm": 0.05,
  "online_learning_train_top_layers": 2,
  "online_learning_min_trust": 0.25,
  "online_learning_max_updates": 32
}

When explicitly enabled and called through learn_from_correction(...), it:

adapts only a tiny top-layer/head surface
deduplicates shared tensors
clips gradients
tracks temporal trust
probes counterfactual loss
reverts failed updates

This is experimental and should be used only for approved corrections or controlled local adaptation.

Final Polish Pass

Final polish is enabled in safe mode:

{
  "enable_final_polish": true,
  "final_polish_mode": "safe"
}

It receives only the latest user prompt and the verified final answer. It cannot see raw worker notes or rejected candidates. If the polish drifts, changes numbers, becomes too short/long, or loses overlap with the verified answer, the runtime keeps the original verified answer.

Validation Summary

From the included reports:

HF config/tokenizer/model load passed.
CUDA forward passed.
Controller loaded with 40 routed layers.
No-profile swarm smoke returned a verified black-hole answer through science_explanation.
Dynamic profile test showed noisy raw workers are rejected and verified skills preserve the answer.
Final polish test preserved verified answers when the polish attempt failed validation.
Learnable scaffold test saved/reloaded blueprint state.
Auxiliary-loss tiny-model test passed forward/loss/backward.

The package includes detailed reports:

QWEN35_4B_FINAL_REPORT.md
QWEN35_4B_COHERENCE_REPORT.md
DYNAMIC_SWARM_ORCHESTRATION_REPORT.md
AGENTIC_SESSION_RUNTIME_REPORT.md
AGI_SKILL_ROUTE_EXPANSION_REPORT.md
LEARNABLE_SCAFFOLDING_REPORT.md
LEARNABLE_LOSSES_AND_ONLINE_LEARNING_REPORT.md
FINAL_POLISH_PASS_REPORT.md

Known Limits

Raw direct generation can still be weaker than verified runtime answers.
Profile generation is not enabled by default because raw worker text can be noisy.
Online learning is disabled by default and should not be treated as automatic safe self-training.
Vision sidecar is runtime behavior; normal text generation does not become a full browser-vision agent by itself.
This README describes implemented local runtime features, not independent benchmark superiority over frontier commercial systems.

File Map

File	Purpose
`config.json`	HF model config and passive runtime metadata
`model.safetensors`	model weights
`configuration_swarm_moe.py`	HF config remote-code file
`modeling_swarm_moe.py`	HF model remote-code file
`tokenizer.json`, `tokenizer_config.json`	tokenizer assets
`swarm_runtime_config.json`	wrapper/runtime config
`scaffold_blueprint.json`	learnable scaffold runtime memory
`swarm_moe_model/`	optional local runtime package
`vision_sidecar.py`, `vision_sidecar/`	optional runtime vision sidecar
`swarm_app_config.example.json`	app config example

Recommended Use Cases

Local research into sparse MoE routing and shared-weight agent orchestration.
Tool-aware chat wrappers where tool execution is explicit and permissioned.
IDE/CLI assistants that need compact tool manifests and traceable routes.
Agentic task runners that need JSON goal state, session memory, and recoverable progress.
Experiments with scaffold learning and safe online correction workflows.
Educational exploration of MoE, routing losses, and wrapper-based agent design.

Minimal Requirements

Python environment with PyTorch and Transformers.
trust_remote_code=True.
bf16-capable CUDA is recommended for the 4B package.
CPU loading may be possible but will be slow.

Attribution And Development Notes

Phill Swarm-MoE is a custom project by Phillip A. Holland / Ayjays132. This package contains a grown hybrid checkpoint and runtime code intended for local experimentation, HF-style loading, and publishable inspection. It is built to be transparent about what is model behavior, what is runtime orchestration, and what is experimental.

Core principle: keep the model loadable as a normal HF checkpoint, then let users opt into the swarm runtime when they want goals, tools, scaffolds, streaming, sessions, and traceable orchestration.

Downloads last month: 11

Safetensors

Model size

4B params

Tensor type

BF16