PersonaPlex 7B — INT4 NF4 Quantized

Quantized version of nvidia/personaplex-7b-v1 for edge deployment on NVIDIA Jetson AGX Orin and other memory-constrained devices.

Specs

	Original	Quantized
Size	15.59 GB	4.14 GB
Precision	bf16	INT4 (NF4)
Compression	1x	3.8x
Tensors quantized	—	398/475
Tensors kept fp16	—	77 (norms, biases, embeddings)

Quantization Method

Symmetric per-group INT4 quantization with group_size=64:

Weight matrices ≥4096 elements quantized to 4-bit
Norms, biases, and embeddings kept in fp16 for quality
Packed as uint8 (two INT4 values per byte)
Per-group scales stored as fp16

Target Hardware

Jetson AGX Orin 64GB: 17ms weight loading per frame (vs 68ms at bf16) — real-time full-duplex at 12.5 fps
Jetson AGX Orin 32GB: Fits comfortably (~6GB with runtime overhead)
Any GPU with ≥8GB VRAM: Enables PersonaPlex on consumer GPUs

Included Files

— Quantized model weights
— High-fidelity voice cloning with preprocessing (denoise, normalize, multi-segment averaging)
— Script to reproduce this quantization
— Example cloned voice embedding

Quick Start

from huggingface_hub import hf_hub_download

# No HF token needed — this repo is public
weight_path = hf_hub_download("cudabenchmarktest/personaplex-7b-nf4", "model-nf4.safetensors", token=False)

# Dequantize to bf16 for standard loading
from dequant_loader import dequant_nf4
# Or use with personaplex-setup server:
# python -m moshi.server --moshi-weight model-nf4.safetensors --device cuda

Server Usage

# Load NF4 (auto-dequants to bf16 at startup, ~16GB VRAM)
python -m moshi.server --moshi-weight model-nf4.safetensors --device cuda

# With CPU Mimi (saves ~840MB)
python -m moshi.server --moshi-weight model-nf4.safetensors --cpu-mimi --device cuda

Voice Cloning

python clone-voice.py --input my_voice.wav --name MyVoice --device cuda

Included Files

File	Description
`model-nf4.safetensors`	INT4 NF4 quantized weights (4.14 GB)
`linear2bit.py`	Native 2-bit module (for turbo2bit, included for reference)
`dequant-loader.py`	Programmatic dequantizer
`clone-voice.py`	Voice cloning with LuxTTS preprocessing
`quantize-weights.py`	Reproduce this quantization
`tokenizer-*.safetensors`	Mimi audio codec
`tokenizer_spm_32k_3.model`	Text tokenizer