PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models
Paper β’ 2602.06053 β’ Published β’ 8
Quantized version of nvidia/personaplex-7b-v1 for edge deployment on NVIDIA Jetson AGX Orin and other memory-constrained devices.
| Original | Quantized | |
|---|---|---|
| Size | 15.59 GB | 4.14 GB |
| Precision | bf16 | INT4 (NF4) |
| Compression | 1x | 3.8x |
| Tensors quantized | β | 398/475 |
| Tensors kept fp16 | β | 77 (norms, biases, embeddings) |
Symmetric per-group INT4 quantization with group_size=64:
from huggingface_hub import hf_hub_download
# No HF token needed β this repo is public
weight_path = hf_hub_download("cudabenchmarktest/personaplex-7b-nf4", "model-nf4.safetensors", token=False)
# Dequantize to bf16 for standard loading
from dequant_loader import dequant_nf4
# Or use with personaplex-setup server:
# python -m moshi.server --moshi-weight model-nf4.safetensors --device cuda
# Load NF4 (auto-dequants to bf16 at startup, ~16GB VRAM)
python -m moshi.server --moshi-weight model-nf4.safetensors --device cuda
# With CPU Mimi (saves ~840MB)
python -m moshi.server --moshi-weight model-nf4.safetensors --cpu-mimi --device cuda
python clone-voice.py --input my_voice.wav --name MyVoice --device cuda
| File | Description |
|---|---|
model-nf4.safetensors |
INT4 NF4 quantized weights (4.14 GB) |
linear2bit.py |
Native 2-bit module (for turbo2bit, included for reference) |
dequant-loader.py |
Programmatic dequantizer |
clone-voice.py |
Voice cloning with LuxTTS preprocessing |
quantize-weights.py |
Reproduce this quantization |
tokenizer-*.safetensors |
Mimi audio codec |
tokenizer_spm_32k_3.model |
Text tokenizer |
Same as the base model: NVIDIA Open Model License
Quantized by open-agents-ai for edge deployment. Base model by NVIDIA Research (PersonaPlex paper).