PersonaPlex-7B Q4_K (WebGPU)

Community quantization. This is an unofficial Q4_K quantization of NVIDIA PersonaPlex-7B-v1 for browser-based inference via WebGPU. Not affiliated with or endorsed by NVIDIA.

Q4_K quantization of the full 32-layer PersonaPlex-7B speech-to-speech model, packaged for client-side WebGPU inference with sts-web.

Original (bf16) This model (Q4_K)
Temporal layers 32 32
Total params 8.37B 8.37B
File size 16.7 GB 4.4 GB
Format safetensors GGUF Q4_K

Quantization

  • Q4_K: All weight matrices (attention projections, gating, linear heads)
  • Q4_0: Embedding tables (efficient CPU row lookups)
  • F32: Layer norm alpha parameters only

Files

File Size Description
shards/personaplex-7b-v1-q4_k.gguf.shard-{00-08} 4.4 GB total Q4_K weights, sharded (<512 MB each for WASM ArrayBuffer limit)
tokenizer-e351c8d8-checkpoint125.safetensors 367 MB Mimi audio codec weights
tokenizer_spm_32k_3.model 540 KB SentencePiece text tokenizer
voices/*.pt ~330 KB each 18 voice prompt embeddings, PyTorch format
voices/*.embeddings.bin ~800 KB each Same embeddings as raw f32le (for web demo)
voices/*.cache.json ~1 KB each Token cache snapshots for voice conditioning (for web demo)
config.json Model architecture metadata

Voices

18 pre-computed voice prompts from the original PersonaPlex release. Each voice has three files:

  • .pt โ€” PyTorch tensor (embeddings + cache, bfloat16)

  • .embeddings.bin โ€” raw f32 little-endian embeddings, shape [num_frames, 4096] (for browser use)

  • .cache.json โ€” token cache snapshot, 17 streams ร— 4 positions (for browser use)

  • NAT (native): NATF0-F3 (female), NATM0-M3 (male)

  • VAR (varied accent): VARF0-F4 (female), VARM0-M4 (male)

Architecture

Temporal Transformer (32 layers)
  dim: 4096, heads: 32, ff: 11264
  RoPE positional encoding (freq_base=10000)

Depth Transformer (6 layers)
  dim: 1024, heads: 16, ff: 2816
  16 codebook-specific gating modules

Audio: 8 codebooks (Mimi codec, 12.5 Hz frame rate)
Text: 32K SentencePiece vocabulary

Usage

This model is designed for use with sts-web, a Rust/WASM WebGPU inference engine. The GGUF shards are loaded by the browser and dequantized on-GPU via WGSL compute shaders.

License

This model inherits the NVIDIA Open Model License from the base PersonaPlex-7B-v1 model, with additional terms from Kyutai's Moshi (CC-BY-4.0).

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for idle-intelligence/personaplex-7b-v1-q4_k-webgpu

Finetuned
(36)
this model