Instructions to use idle-intelligence/personaplex-24L-q4_k-webgpu with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Moshi
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Moshi:
# pip install moshi # Run the interactive web server python -m moshi.server --hf-repo "idle-intelligence/personaplex-24L-q4_k-webgpu" # Then open https://localhost:8998 in your browser
# pip install moshi import torch from moshi.models import loaders # Load checkpoint info from HuggingFace checkpoint = loaders.CheckpointInfo.from_hf_repo("idle-intelligence/personaplex-24L-q4_k-webgpu") # Load the Mimi audio codec mimi = checkpoint.get_mimi(device="cuda") mimi.set_num_codebooks(8) # Encode audio (24kHz, mono) wav = torch.randn(1, 1, 24000 * 10) # [batch, channels, samples] with torch.no_grad(): codes = mimi.encode(wav.cuda()) decoded = mimi.decode(codes) - llama-cpp-python
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="idle-intelligence/personaplex-24L-q4_k-webgpu", filename="personaplex-24L-q4_k.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf idle-intelligence/personaplex-24L-q4_k-webgpu # Run inference directly in the terminal: llama-cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf idle-intelligence/personaplex-24L-q4_k-webgpu # Run inference directly in the terminal: llama-cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf idle-intelligence/personaplex-24L-q4_k-webgpu # Run inference directly in the terminal: ./llama-cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf idle-intelligence/personaplex-24L-q4_k-webgpu # Run inference directly in the terminal: ./build/bin/llama-cli -hf idle-intelligence/personaplex-24L-q4_k-webgpu
Use Docker
docker model run hf.co/idle-intelligence/personaplex-24L-q4_k-webgpu
- LM Studio
- Jan
- Ollama
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Ollama:
ollama run hf.co/idle-intelligence/personaplex-24L-q4_k-webgpu
- Unsloth Studio new
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for idle-intelligence/personaplex-24L-q4_k-webgpu to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for idle-intelligence/personaplex-24L-q4_k-webgpu to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for idle-intelligence/personaplex-24L-q4_k-webgpu to start chatting
- Docker Model Runner
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Docker Model Runner:
docker model run hf.co/idle-intelligence/personaplex-24L-q4_k-webgpu
- Lemonade
How to use idle-intelligence/personaplex-24L-q4_k-webgpu with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull idle-intelligence/personaplex-24L-q4_k-webgpu
Run and chat with the model
lemonade run user.personaplex-24L-q4_k-webgpu-{{QUANT_TAG}}List all available models
lemonade list
PersonaPlex-24L Q4_K (WebGPU)
Work in progress. This model exists primarily to test layer pruning + QLoRA recovery for browser deployment. Model behavior may differ from the original 32L model. No guarantees โ use at your own discretion.
A smaller, quantized version of NVIDIA PersonaPlex-7B-v1 optimized for browser-based speech-to-speech inference via WebGPU.
What is this?
PersonaPlex-7B is a full-duplex speech-to-speech model based on Moshi. The original model has 8.37B parameters across a 32-layer temporal transformer and a 6-layer depth transformer.
This version removes 8 temporal transformer layers (layers 12-19) and recovers quality through LoRA fine-tuning, then quantizes to Q4_K for efficient browser deployment. Quality has been assessed qualitatively through listening tests only โ no formal metrics are available.
| Original (32L bf16) | Original Q4_K | This model (24L Q4_K) | |
|---|---|---|---|
| Temporal layers | 32 | 32 | 24 |
| Total params | 8.37B | 8.37B | 6.74B |
| File size | 16.7 GB | 4.4 GB | 3.5 GB |
| Format | safetensors | GGUF Q4_K | GGUF Q4_K |
How it was made
1. Layer Pruning
Removed temporal transformer layers 12-19 (the middle 8 of 32). The depth transformer (6 layers, 1024 dim) is kept intact. This removes ~1.6B parameters from the temporal transformer.
2. LoRA Recovery Training
After pruning, the model loses semantic understanding. We recover quality using LoRA fine-tuning on self-distillation data:
- LoRA config: rank 32, alpha 64, target modules:
out_projonly (9.6M trainable params, 0.14% of model) - Training data: 333 self-distillation files generated by running the full 32L teacher model on natural conversation audio (Santa Barbara Corpus) across all 18 PersonaPlex voices
- Training: 3 epochs on CPU (bf16 weights, float32 compute), AdamW optimizer with cosine LR decay
- Loss:
3.0 * text_cross_entropy + weighted_audio_cross_entropy(first audio codebook weighted 5x)
3. Q4_K Quantization
The merged LoRA model is quantized using the same Q4_K scheme as the original WebGPU version:
- Q4_K: All weight matrices (attention projections, gating, linear heads)
- Q4_0: Embedding tables (for efficient CPU row lookups)
- F32: Layer norm alpha parameters only
Files
| File | Size | Description |
|---|---|---|
personaplex-24L-q4_k.gguf |
3.5 GB | Q4_K quantized model weights (single file) |
shards/personaplex-24L-q4_k.gguf.shard-{00-07} |
3.5 GB total | Same weights, sharded (<512 MB each for WASM ArrayBuffer limit) |
tokenizer-e351c8d8-checkpoint125.safetensors |
367 MB | Mimi audio codec weights |
tokenizer_spm_32k_3.model |
540 KB | SentencePiece text tokenizer |
voices/*.pt |
~330 KB each | 18 voice prompt embeddings, PyTorch format (8 NAT + 10 VAR) |
voices/*.embeddings.bin |
~800 KB each | Same embeddings as raw f32le (for web demo) |
voices/*.cache.json |
~1 KB each | Token cache snapshots for voice conditioning (for web demo) |
config.json |
Model architecture and training metadata |
Voices
18 pre-computed voice prompts from the original PersonaPlex release. Each voice has three files:
.ptโ PyTorch tensor (embeddings + cache, bfloat16).embeddings.binโ raw f32 little-endian embeddings, shape[num_frames, 4096](for browser use).cache.jsonโ token cache snapshot, 17 streams ร 4 positions (for browser use)NAT (native): NATF0-F3 (female), NATM0-M3 (male)
VAR (varied accent): VARF0-F4 (female), VARM0-M4 (male)
Architecture
Temporal Transformer (24 layers)
dim: 4096, heads: 32, ff: 11264
RoPE positional encoding (freq_base=10000)
Depth Transformer (6 layers, unchanged)
dim: 1024, heads: 16, ff: 2816
16 codebook-specific gating modules
Audio: 8 codebooks (Mimi codec, 12.5 Hz frame rate)
Text: 32K SentencePiece vocabulary
Usage
This model is designed for use with sts-web, a Rust/WASM WebGPU inference engine. The GGUF file is loaded by the browser and dequantized on-GPU via WGSL compute shaders. Sharded versions are provided for environments with a 2 GB ArrayBuffer limit.
Limitations
- Layer pruning reduces semantic understanding compared to the full 32L model
- LoRA recovery was trained on English conversational data only
- The model may default to a greeting ("Hey, let me know if you have any questions") at the start of inference โ this is expected and typically discarded during the system prompt warmup phase
License
This model inherits the NVIDIA Open Model License from the base PersonaPlex-7B-v1 model, with additional terms from Kyutai's Moshi (CC-BY-4.0).
- Downloads last month
- 54