PersonaPlex-7B Q4_K (WebGPU)

Community quantization. This is an unofficial Q4_K quantization of NVIDIA PersonaPlex-7B-v1 for browser-based inference via WebGPU. Not affiliated with or endorsed by NVIDIA.

Q4_K quantization of the full 32-layer PersonaPlex-7B speech-to-speech model, packaged for client-side WebGPU inference with sts-web.

	Original (bf16)	This model (Q4_K)
Temporal layers	32	32
Total params	8.37B	8.37B
File size	16.7 GB	4.4 GB
Format	safetensors	GGUF Q4_K

Quantization

Q4_K: All weight matrices (attention projections, gating, linear heads)
Q4_0: Embedding tables (efficient CPU row lookups)
F32: Layer norm alpha parameters only

Files

File	Size	Description
`shards/personaplex-7b-v1-q4_k.gguf.shard-{00-08}`	4.4 GB total	Q4_K weights, sharded (<512 MB each for WASM ArrayBuffer limit)
`tokenizer-e351c8d8-checkpoint125.safetensors`	367 MB	Mimi audio codec weights
`tokenizer_spm_32k_3.model`	540 KB	SentencePiece text tokenizer
`voices/*.pt`	~330 KB each	18 voice prompt embeddings, PyTorch format
`voices/*.embeddings.bin`	~800 KB each	Same embeddings as raw f32le (for web demo)
`voices/*.cache.json`	~1 KB each	Token cache snapshots for voice conditioning (for web demo)
`config.json`		Model architecture metadata

Voices

18 pre-computed voice prompts from the original PersonaPlex release. Each voice has three files:

.pt — PyTorch tensor (embeddings + cache, bfloat16)
.embeddings.bin — raw f32 little-endian embeddings, shape [num_frames, 4096] (for browser use)
.cache.json — token cache snapshot, 17 streams × 4 positions (for browser use)
NAT (native): NATF0-F3 (female), NATM0-M3 (male)
VAR (varied accent): VARF0-F4 (female), VARM0-M4 (male)

Architecture

Temporal Transformer (32 layers)
  dim: 4096, heads: 32, ff: 11264
  RoPE positional encoding (freq_base=10000)

Depth Transformer (6 layers)
  dim: 1024, heads: 16, ff: 2816
  16 codebook-specific gating modules

Audio: 8 codebooks (Mimi codec, 12.5 Hz frame rate)
Text: 32K SentencePiece vocabulary

Usage

This model is designed for use with sts-web, a Rust/WASM WebGPU inference engine. The GGUF shards are loaded by the browser and dequantized on-GPU via WGSL compute shaders.

License

This model inherits the NVIDIA Open Model License from the base PersonaPlex-7B-v1 model, with additional terms from Kyutai's Moshi (CC-BY-4.0).

Downloads last month: 7

Model tree for idle-intelligence/personaplex-7b-v1-q4_k-webgpu

Base model

kyutai/moshiko-pytorch-bf16

Finetuned

nvidia/personaplex-7b-v1

Finetuned

(36)

this model