SAINEMO-reMIX (FP4 Blackwell Optimized)

This repository contains the FP4 (NVFP4) quantized version of Moraliane/SAINEMO-reMIX, specifically optimized for the NVIDIA Blackwell (B200/B100) architecture.

πŸš€ Key Features

  • Format: NVFP4 (E2M1) using NVIDIA ModelOpt.
  • Native Blackwell Support: Leverages the hardware FP4 engines of the Blackwell architecture for maximum inference throughput.
  • Pre-compiled Engine: Includes a ready-to-use TensorRT-LLM engine for Blackwell (Compute Capability 12.0).
  • Fixed Tokenizer: Configured to work seamlessly with both transformers (v4.56+) and TensorRT-LLM.

πŸ“¦ Repository Structure

  • model.safetensors: Quantized weights in HF format (with _amax metadata).
  • tokenizer_config.json: Patched for generic fast tokenizer support.
  • engine/:
    • rank0.engine: Pre-compiled inference engine for NVIDIA Blackwell (CC 12.0).
    • config.json: Engine configuration.

πŸ›  Usage

Using the Pre-compiled Engine (Fastest)

Requirements: tensorrt-llm (v1.1.0+) installed on a Blackwell machine.

import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
from transformers import AutoTokenizer
import torch

# Path to the 'engine' folder from this repo
engine_dir = "./engine"

# Initialize runner
runner = ModelRunner.from_dir(engine_dir=engine_dir, rank=0)

# Load tokenizer from the root of this repo
tokenizer = AutoTokenizer.from_pretrained(".")

# Generate
input_ids = tokenizer.encode("Hello, who are you?", return_tensors="pt").int().cuda()
outputs = runner.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(outputs[0][0][len(input_ids[0]):], skip_special_tokens=True))

Re-building the Engine

If you need to change build parameters (e.g., max batch size or sequence length):

# Convert to TRT-LLM checkpoint
python convert_checkpoint.py --model_dir . --output_dir ./trt_ckpt --dtype bfloat16

# Build engine for Blackwell
trtllm-build --checkpoint_dir ./trt_ckpt \
            --output_dir ./engine \
            --gemm_plugin nvfp4

πŸ“Š Quantization Details

Quantized using NVIDIA ModelOpt with mtq.NVFP4_DEFAULT_CFG.

  • Calibration: Offline calibration on 512 samples of cnn_dailymail.
  • Precision: Weights and Activations in FP4, KV Cache in BF16 (default).

⚠️ Important Note

This model requires an NVIDIA GPU with Compute Capability 12.0 (Blackwell) for native FP4 execution. On older architectures, it will not run or will run significantly slower via emulation.

Downloads last month
27
Safetensors
Model size
12B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support