PersonaPlex 7B β€” INT4 NF4 Quantized

Quantized version of nvidia/personaplex-7b-v1 for edge deployment on NVIDIA Jetson AGX Orin and other memory-constrained devices.

Specs

Original Quantized
Size 15.59 GB 4.14 GB
Precision bf16 INT4 (NF4)
Compression 1x 3.8x
Tensors quantized β€” 398/475
Tensors kept fp16 β€” 77 (norms, biases, embeddings)

Quantization Method

Symmetric per-group INT4 quantization with group_size=64:

  • Weight matrices β‰₯4096 elements quantized to 4-bit
  • Norms, biases, and embeddings kept in fp16 for quality
  • Packed as uint8 (two INT4 values per byte)
  • Per-group scales stored as fp16

Target Hardware

  • Jetson AGX Orin 64GB: 17ms weight loading per frame (vs 68ms at bf16) β€” real-time full-duplex at 12.5 fps
  • Jetson AGX Orin 32GB: Fits comfortably (~6GB with runtime overhead)
  • Any GPU with β‰₯8GB VRAM: Enables PersonaPlex on consumer GPUs

Included Files

  • β€” Quantized model weights
  • β€” High-fidelity voice cloning with preprocessing (denoise, normalize, multi-segment averaging)
  • β€” Script to reproduce this quantization
  • β€” Example cloned voice embedding

Quick Start

from huggingface_hub import hf_hub_download

# No HF token needed β€” this repo is public
weight_path = hf_hub_download("cudabenchmarktest/personaplex-7b-nf4", "model-nf4.safetensors", token=False)

# Dequantize to bf16 for standard loading
from dequant_loader import dequant_nf4
# Or use with personaplex-setup server:
# python -m moshi.server --moshi-weight model-nf4.safetensors --device cuda

Server Usage

# Load NF4 (auto-dequants to bf16 at startup, ~16GB VRAM)
python -m moshi.server --moshi-weight model-nf4.safetensors --device cuda

# With CPU Mimi (saves ~840MB)
python -m moshi.server --moshi-weight model-nf4.safetensors --cpu-mimi --device cuda

Voice Cloning

python clone-voice.py --input my_voice.wav --name MyVoice --device cuda

Included Files

File Description
model-nf4.safetensors INT4 NF4 quantized weights (4.14 GB)
linear2bit.py Native 2-bit module (for turbo2bit, included for reference)
dequant-loader.py Programmatic dequantizer
clone-voice.py Voice cloning with LuxTTS preprocessing
quantize-weights.py Reproduce this quantization
tokenizer-*.safetensors Mimi audio codec
tokenizer_spm_32k_3.model Text tokenizer

License

Same as the base model: NVIDIA Open Model License

Credits

Quantized by open-agents-ai for edge deployment. Base model by NVIDIA Research (PersonaPlex paper).

Downloads last month
80
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cudabenchmarktest/personaplex-7b-nf4

Finetuned
(36)
this model

Paper for cudabenchmarktest/personaplex-7b-nf4