Instructions to use idle-intelligence/personaplex-24L-q4_k-webgpu with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Moshi:

# pip install moshi
# Run the interactive web server
python -m moshi.server --hf-repo "idle-intelligence/personaplex-24L-q4_k-webgpu"
# Then open https://localhost:8998 in your browser

# pip install moshi
import torch
from moshi.models import loaders

# Load checkpoint info from HuggingFace
checkpoint = loaders.CheckpointInfo.from_hf_repo("idle-intelligence/personaplex-24L-q4_k-webgpu")

# Load the Mimi audio codec
mimi = checkpoint.get_mimi(device="cuda")
mimi.set_num_codebooks(8)

# Encode audio (24kHz, mono)
wav = torch.randn(1, 1, 24000 * 10)  # [batch, channels, samples]
with torch.no_grad():
    codes = mimi.encode(wav.cuda())
    decoded = mimi.decode(codes)

llama-cpp-python

How to use idle-intelligence/personaplex-24L-q4_k-webgpu with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="idle-intelligence/personaplex-24L-q4_k-webgpu",
	filename="personaplex-24L-q4_k.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use idle-intelligence/personaplex-24L-q4_k-webgpu with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf idle-intelligence/personaplex-24L-q4_k-webgpu
# Run inference directly in the terminal:
llama cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf idle-intelligence/personaplex-24L-q4_k-webgpu
# Run inference directly in the terminal:
llama cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf idle-intelligence/personaplex-24L-q4_k-webgpu
# Run inference directly in the terminal:
./llama-cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf idle-intelligence/personaplex-24L-q4_k-webgpu
# Run inference directly in the terminal:
./build/bin/llama-cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu

Use Docker

docker model run hf.co/idle-intelligence/personaplex-24L-q4_k-webgpu

LM Studio
Jan
Ollama
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Ollama:
```
ollama run hf.co/idle-intelligence/personaplex-24L-q4_k-webgpu
```

Unsloth Studio

How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for idle-intelligence/personaplex-24L-q4_k-webgpu to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for idle-intelligence/personaplex-24L-q4_k-webgpu to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for idle-intelligence/personaplex-24L-q4_k-webgpu to start chatting

Atomic Chat new
Docker Model Runner
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Docker Model Runner:
```
docker model run hf.co/idle-intelligence/personaplex-24L-q4_k-webgpu
```

Lemonade

How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull idle-intelligence/personaplex-24L-q4_k-webgpu

Run and chat with the model

lemonade run user.personaplex-24L-q4_k-webgpu-{{QUANT_TAG}}

List all available models

lemonade list

PersonaPlex-24L Q4_K (WebGPU)

Work in progress. This model exists primarily to test layer pruning + QLoRA recovery for browser deployment. Model behavior may differ from the original 32L model. No guarantees — use at your own discretion.

A smaller, quantized version of NVIDIA PersonaPlex-7B-v1 optimized for browser-based speech-to-speech inference via WebGPU.

What is this?

PersonaPlex-7B is a full-duplex speech-to-speech model based on Moshi. The original model has 8.37B parameters across a 32-layer temporal transformer and a 6-layer depth transformer.

This version removes 8 temporal transformer layers (layers 12-19) and recovers quality through LoRA fine-tuning, then quantizes to Q4_K for efficient browser deployment. Quality has been assessed qualitatively through listening tests only — no formal metrics are available.

	Original (32L bf16)	Original Q4_K	This model (24L Q4_K)
Temporal layers	32	32	24
Total params	8.37B	8.37B	6.74B
File size	16.7 GB	4.4 GB	3.5 GB
Format	safetensors	GGUF Q4_K	GGUF Q4_K

How it was made

1. Layer Pruning

Removed temporal transformer layers 12-19 (the middle 8 of 32). The depth transformer (6 layers, 1024 dim) is kept intact. This removes ~1.6B parameters from the temporal transformer.

2. LoRA Recovery Training

After pruning, the model loses semantic understanding. We recover quality using LoRA fine-tuning on self-distillation data:

LoRA config: rank 32, alpha 64, target modules: out_proj only (9.6M trainable params, 0.14% of model)
Training data: 333 self-distillation files generated by running the full 32L teacher model on natural conversation audio (Santa Barbara Corpus) across all 18 PersonaPlex voices
Training: 3 epochs on CPU (bf16 weights, float32 compute), AdamW optimizer with cosine LR decay
Loss: 3.0 * text_cross_entropy + weighted_audio_cross_entropy (first audio codebook weighted 5x)

3. Q4_K Quantization

The merged LoRA model is quantized using the same Q4_K scheme as the original WebGPU version:

Q4_K: All weight matrices (attention projections, gating, linear heads)
Q4_0: Embedding tables (for efficient CPU row lookups)
F32: Layer norm alpha parameters only

Files

File	Size	Description
`personaplex-24L-q4_k.gguf`	3.5 GB	Q4_K quantized model weights (single file)
`shards/personaplex-24L-q4_k.gguf.shard-{00-07}`	3.5 GB total	Same weights, sharded (<512 MB each for WASM ArrayBuffer limit)
`tokenizer-e351c8d8-checkpoint125.safetensors`	367 MB	Mimi audio codec weights
`tokenizer_spm_32k_3.model`	540 KB	SentencePiece text tokenizer
`voices/*.pt`	~330 KB each	18 voice prompt embeddings, PyTorch format (8 NAT + 10 VAR)
`voices/*.embeddings.bin`	~800 KB each	Same embeddings as raw f32le (for web demo)
`voices/*.cache.json`	~1 KB each	Token cache snapshots for voice conditioning (for web demo)
`config.json`		Model architecture and training metadata

Voices

18 pre-computed voice prompts from the original PersonaPlex release. Each voice has three files:

.pt — PyTorch tensor (embeddings + cache, bfloat16)
.embeddings.bin — raw f32 little-endian embeddings, shape [num_frames, 4096] (for browser use)
.cache.json — token cache snapshot, 17 streams × 4 positions (for browser use)
NAT (native): NATF0-F3 (female), NATM0-M3 (male)
VAR (varied accent): VARF0-F4 (female), VARM0-M4 (male)

Architecture

Temporal Transformer (24 layers)
  dim: 4096, heads: 32, ff: 11264
  RoPE positional encoding (freq_base=10000)

Depth Transformer (6 layers, unchanged)
  dim: 1024, heads: 16, ff: 2816
  16 codebook-specific gating modules

Audio: 8 codebooks (Mimi codec, 12.5 Hz frame rate)
Text: 32K SentencePiece vocabulary

Usage

This model is designed for use with sts-web, a Rust/WASM WebGPU inference engine. The GGUF file is loaded by the browser and dequantized on-GPU via WGSL compute shaders. Sharded versions are provided for environments with a 2 GB ArrayBuffer limit.

Limitations

Layer pruning reduces semantic understanding compared to the full 32L model
LoRA recovery was trained on English conversational data only
The model may default to a greeting ("Hey, let me know if you have any questions") at the start of inference — this is expected and typically discarded during the system prompt warmup phase

License

This model inherits the NVIDIA Open Model License from the base PersonaPlex-7B-v1 model, with additional terms from Kyutai's Moshi (CC-BY-4.0).

Downloads last month: 22

GGUF

Model size

7B params

Architecture

personaplex

Hardware compatibility

Model tree for idle-intelligence/personaplex-24L-q4_k-webgpu

Base model

kyutai/moshiko-pytorch-bf16

Finetuned

nvidia/personaplex-7b-v1

Quantized

(9)

this model