SAINEMO-reMIX (FP4 Blackwell Optimized)
This repository contains the FP4 (NVFP4) quantized version of Moraliane/SAINEMO-reMIX, specifically optimized for the NVIDIA Blackwell (B200/B100) architecture.
π Key Features
- Format: NVFP4 (E2M1) using NVIDIA ModelOpt.
- Native Blackwell Support: Leverages the hardware FP4 engines of the Blackwell architecture for maximum inference throughput.
- Pre-compiled Engine: Includes a ready-to-use TensorRT-LLM engine for Blackwell (Compute Capability 12.0).
- Fixed Tokenizer: Configured to work seamlessly with both
transformers(v4.56+) and TensorRT-LLM.
π¦ Repository Structure
model.safetensors: Quantized weights in HF format (with_amaxmetadata).tokenizer_config.json: Patched for generic fast tokenizer support.engine/:rank0.engine: Pre-compiled inference engine for NVIDIA Blackwell (CC 12.0).config.json: Engine configuration.
π Usage
Using the Pre-compiled Engine (Fastest)
Requirements: tensorrt-llm (v1.1.0+) installed on a Blackwell machine.
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
from transformers import AutoTokenizer
import torch
# Path to the 'engine' folder from this repo
engine_dir = "./engine"
# Initialize runner
runner = ModelRunner.from_dir(engine_dir=engine_dir, rank=0)
# Load tokenizer from the root of this repo
tokenizer = AutoTokenizer.from_pretrained(".")
# Generate
input_ids = tokenizer.encode("Hello, who are you?", return_tensors="pt").int().cuda()
outputs = runner.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(outputs[0][0][len(input_ids[0]):], skip_special_tokens=True))
Re-building the Engine
If you need to change build parameters (e.g., max batch size or sequence length):
# Convert to TRT-LLM checkpoint
python convert_checkpoint.py --model_dir . --output_dir ./trt_ckpt --dtype bfloat16
# Build engine for Blackwell
trtllm-build --checkpoint_dir ./trt_ckpt \
--output_dir ./engine \
--gemm_plugin nvfp4
π Quantization Details
Quantized using NVIDIA ModelOpt with mtq.NVFP4_DEFAULT_CFG.
- Calibration: Offline calibration on 512 samples of
cnn_dailymail. - Precision: Weights and Activations in FP4, KV Cache in BF16 (default).
β οΈ Important Note
This model requires an NVIDIA GPU with Compute Capability 12.0 (Blackwell) for native FP4 execution. On older architectures, it will not run or will run significantly slower via emulation.
- Downloads last month
- 27
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support