skt/A.X-3.1 — NVFP4 Quantized

NVIDIA FP4 (NVFP4) quantized version of skt/A.X-3.1, a 35B-parameter Korean large language model.

Model Details

Property Value
Base Model skt/A.X-3.1 (35B params)
Architecture LlamaForCausalLM
Quantization NVFP4 (4-bit floating point, Blackwell-native)
Quantization Tool nvidia-modelopt v0.44.0
Quantization Config NVFP4_DEFAULT_CFG (max algorithm)
Model Size ~20.5 GB (3 shards)
Original Size ~64.6 GB (FP16)
Compression Ratio 3.15x
Context Length 32,768 tokens
Vocab Size 102,400

Performance

Benchmarked on NVIDIA DGX Spark (Blackwell GB10, 128GB unified LPDDR5X):

Metric NVFP4 (this model) FP16 Original
PPL (8 Korean eval texts) 4.49 4.88
Speed (vLLM 0.19.1) ~10 t/s ~3.5 t/s
Memory 20.5 GB 64.6 GB

PPL (Perplexity) measured on 8 diverse Korean texts (289 tokens total) using vLLM logprobs API. Lower is better.

Key finding: NVFP4 quantization achieves virtually identical quality to FP16 while being ~3x faster and using ~3x less memory.

How to Use

With vLLM (Recommended)

# Requires NVIDIA Blackwell GPU (sm_121a) and vLLM with NVFP4 support
vllm serve dlsxj101/A.X-3.1-NVFP4 \
  --quantization fp4 \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

With vLLM Docker

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  ghcr.io/bjk110/vllm-spark:v019-ngc2603 \
  python3 -m vllm.entrypoints.openai.api_server \
  --model dlsxj101/A.X-3.1-NVFP4 \
  --quantization fp4 \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --host 0.0.0.0 --port 8000

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="dlsxj101/A.X-3.1-NVFP4",
    messages=[{"role": "user", "content": "한국의 AI 산업 현황을 설명해주세요."}],
    max_tokens=1024,
    temperature=0.7,
)
print(response.choices[0].message.content)

Hardware Requirements

  • GPU: NVIDIA Blackwell architecture (GB10, GB100, GB200, B100, B200)
    • NVFP4 is a Blackwell-native format computed directly on Tensor Cores
    • Not compatible with pre-Blackwell GPUs (A100, H100, etc.)
  • Memory: ~21 GB GPU memory minimum
  • Software: vLLM >= 0.19.0 with NVFP4 support

Quantization Details

  • Algorithm: max (NVFP4_DEFAULT_CFG) — measures maximum activation values per tensor
  • Group Size: 16
  • Excluded Modules: lm_head (kept in FP16)
  • Calibration: 8 English text samples (sufficient for max algorithm)
  • Quantization Time: ~1 minute on DGX Spark

Qualitative Evaluation

Tested across 8 categories (Korean knowledge, logic, creative writing, coding, summarization, math, fact-checking, English):

  • Korean Knowledge: Accurate, well-structured responses identical to FP16
  • Logic/Reasoning: Correct problem-solving with proper mathematical notation
  • Creative Writing: Natural Korean poetry with appropriate imagery
  • Coding: Correct Python code with proper explanations
  • Summarization: Concise and accurate 3-sentence summaries
  • Math: Correct differentiation with step-by-step solutions
  • Fact-Checking: Accurate historical information
  • English: Clear, well-organized English explanations

License

This model is released under the Apache 2.0 license, same as the base model skt/A.X-3.1.

Acknowledgments

  • Quantum Nexus — Quantization, benchmarking, and deployment performed on Quantum Nexus's NVIDIA DGX Spark (Blackwell GB10, 128GB)
  • SKT for the original A.X-3.1 model
  • NVIDIA for ModelOpt quantization toolkit and DGX Spark hardware
  • vLLM team for NVFP4 inference support
Downloads last month
-
Safetensors
Model size
18B params
Tensor type
F16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dlsxj101/A.X-3.1-NVFP4

Base model

skt/A.X-3.1
Quantized
(5)
this model