PersonaPlex-7B Q4_K (WebGPU)
Community quantization. This is an unofficial Q4_K quantization of NVIDIA PersonaPlex-7B-v1 for browser-based inference via WebGPU. Not affiliated with or endorsed by NVIDIA.
Q4_K quantization of the full 32-layer PersonaPlex-7B speech-to-speech model, packaged for client-side WebGPU inference with sts-web.
| Original (bf16) | This model (Q4_K) | |
|---|---|---|
| Temporal layers | 32 | 32 |
| Total params | 8.37B | 8.37B |
| File size | 16.7 GB | 4.4 GB |
| Format | safetensors | GGUF Q4_K |
Quantization
- Q4_K: All weight matrices (attention projections, gating, linear heads)
- Q4_0: Embedding tables (efficient CPU row lookups)
- F32: Layer norm alpha parameters only
Files
| File | Size | Description |
|---|---|---|
shards/personaplex-7b-v1-q4_k.gguf.shard-{00-08} |
4.4 GB total | Q4_K weights, sharded (<512 MB each for WASM ArrayBuffer limit) |
tokenizer-e351c8d8-checkpoint125.safetensors |
367 MB | Mimi audio codec weights |
tokenizer_spm_32k_3.model |
540 KB | SentencePiece text tokenizer |
voices/*.pt |
~330 KB each | 18 voice prompt embeddings, PyTorch format |
voices/*.embeddings.bin |
~800 KB each | Same embeddings as raw f32le (for web demo) |
voices/*.cache.json |
~1 KB each | Token cache snapshots for voice conditioning (for web demo) |
config.json |
Model architecture metadata |
Voices
18 pre-computed voice prompts from the original PersonaPlex release. Each voice has three files:
.ptโ PyTorch tensor (embeddings + cache, bfloat16).embeddings.binโ raw f32 little-endian embeddings, shape[num_frames, 4096](for browser use).cache.jsonโ token cache snapshot, 17 streams ร 4 positions (for browser use)NAT (native): NATF0-F3 (female), NATM0-M3 (male)
VAR (varied accent): VARF0-F4 (female), VARM0-M4 (male)
Architecture
Temporal Transformer (32 layers)
dim: 4096, heads: 32, ff: 11264
RoPE positional encoding (freq_base=10000)
Depth Transformer (6 layers)
dim: 1024, heads: 16, ff: 2816
16 codebook-specific gating modules
Audio: 8 codebooks (Mimi codec, 12.5 Hz frame rate)
Text: 32K SentencePiece vocabulary
Usage
This model is designed for use with sts-web, a Rust/WASM WebGPU inference engine. The GGUF shards are loaded by the browser and dequantized on-GPU via WGSL compute shaders.
License
This model inherits the NVIDIA Open Model License from the base PersonaPlex-7B-v1 model, with additional terms from Kyutai's Moshi (CC-BY-4.0).
- Downloads last month
- 7