Instructions to use dlsxj101/A.X-3.1-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dlsxj101/A.X-3.1-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dlsxj101/A.X-3.1-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dlsxj101/A.X-3.1-NVFP4")
model = AutoModelForCausalLM.from_pretrained("dlsxj101/A.X-3.1-NVFP4", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dlsxj101/A.X-3.1-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dlsxj101/A.X-3.1-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dlsxj101/A.X-3.1-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dlsxj101/A.X-3.1-NVFP4

SGLang

How to use dlsxj101/A.X-3.1-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dlsxj101/A.X-3.1-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dlsxj101/A.X-3.1-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dlsxj101/A.X-3.1-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dlsxj101/A.X-3.1-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dlsxj101/A.X-3.1-NVFP4 with Docker Model Runner:
```
docker model run hf.co/dlsxj101/A.X-3.1-NVFP4
```

skt/A.X-3.1 — NVFP4 Quantized

NVIDIA FP4 (NVFP4) quantized version of skt/A.X-3.1, a 35B-parameter Korean large language model.

Model Details

Property	Value
Base Model	skt/A.X-3.1 (35B params)
Architecture	LlamaForCausalLM
Quantization	NVFP4 (4-bit floating point, Blackwell-native)
Quantization Tool	nvidia-modelopt v0.44.0
Quantization Config	`NVFP4_DEFAULT_CFG` (max algorithm)
Model Size	~20.5 GB (3 shards)
Original Size	~64.6 GB (FP16)
Compression Ratio	3.15x
Context Length	32,768 tokens
Vocab Size	102,400

Performance

Benchmarked on NVIDIA DGX Spark (Blackwell GB10, 128GB unified LPDDR5X):

Metric	NVFP4 (this model)	FP16 Original
PPL (8 Korean eval texts)	4.49	4.88
Speed (vLLM 0.19.1)	~10 t/s	~3.5 t/s
Memory	20.5 GB	64.6 GB

PPL (Perplexity) measured on 8 diverse Korean texts (289 tokens total) using vLLM logprobs API. Lower is better.

Key finding: NVFP4 quantization achieves virtually identical quality to FP16 while being ~3x faster and using ~3x less memory.

Benchmark Results (Accuracy vs Original)

Evaluated using the same Chat CoT protocol as the original model (0-shot, chat template applied, exact_match on the generated answer — the Llama 3 evaluation methodology SKT used for A.X-3.1). This ensures a fair, apples-to-apples comparison between the original FP16 model and the NVFP4 quantized version.

Category	Benchmark	A.X-3.1 (Original FP16)	A.X-3.1-NVFP4	Recovery
Knowledge	KMMLU (Chat CoT, 0-shot)	69.73%	67.08%	96.2%
Knowledge	CLIcK (Chat CoT, 0-shot)	77.09%	76.99%	99.9%
Knowledge	MMLU (CoT, 0-shot, test)	75.20%	73.22%	97.4%
Instruction	IFEval (0-shot)	87.11%	85.29%	97.9%
Math	MATH (CoT, 0-shot)	75.40%	73.54%	97.5%
	Average			97.8%

Average recovery 97.8% across 5 benchmarks — NVFP4 4-bit quantization preserves nearly all of the original model's accuracy. On CLIcK the gap is just 0.10pp (essentially lossless).

Per-domain breakdown:

KMMLU (45 subjects, 35,030 Q)	STEM	HUMSS	Applied Science	Other
	69.40%	69.22%	65.48%	65.25%

CLIcK (1,995 Q)	Culture	Language
	78.96%	72.92%

MMLU (14,042 Q, test)	STEM	Social Sciences	Other	Humanities
	80.7%	80.4%	75.7%	61.9%

MATH (5,000 Q)	Algebra	Prealgebra	Num. Theory	Counting	Precalc	Geometry	Int. Algebra
	88.3%	79.8%	69.8%	69.6%	67.2%	62.2%	62.2%

IFEval (4 sub-metrics): prompt-strict 81.89% · inst-strict 87.29% · prompt-loose 83.36% · inst-loose 88.61% (avg 85.29%)

Evaluation: lm-evaluation-harness via local-chat-completions, vLLM 0.19.1 on NVIDIA DGX Spark. Original FP16 scores from the skt/A.X-3.1 model card. Knowledge benchmarks use the 0-shot Chat CoT protocol (chat template + step-by-step reasoning + exact_match); MMLU uses flexible-extract on the full test split, and MATH uses math_verify (symbolic equivalence) — both to match the original's methodology. IFEval recovery is vs the 4-metric average.

How to Use

With vLLM (Recommended)

# Requires NVIDIA Blackwell GPU (sm_121a) and vLLM with NVFP4 support
vllm serve dlsxj101/A.X-3.1-NVFP4 \
  --quantization fp4 \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

With vLLM Docker

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  ghcr.io/bjk110/vllm-spark:v019-ngc2603 \
  python3 -m vllm.entrypoints.openai.api_server \
  --model dlsxj101/A.X-3.1-NVFP4 \
  --quantization fp4 \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --host 0.0.0.0 --port 8000

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="dlsxj101/A.X-3.1-NVFP4",
    messages=[{"role": "user", "content": "한국의 AI 산업 현황을 설명해주세요."}],
    max_tokens=1024,
    temperature=0.7,
)
print(response.choices[0].message.content)

Hardware Requirements

GPU: NVIDIA Blackwell architecture (GB10, GB100, GB200, B100, B200)
- NVFP4 is a Blackwell-native format computed directly on Tensor Cores
- Not compatible with pre-Blackwell GPUs (A100, H100, etc.)
Memory: ~21 GB GPU memory minimum
Software: vLLM >= 0.19.0 with NVFP4 support

Quantization Details

Algorithm: max (NVFP4_DEFAULT_CFG) — measures maximum activation values per tensor
Group Size: 16
Excluded Modules: lm_head (kept in FP16)
Calibration: 8 English text samples (sufficient for max algorithm)
Quantization Time: ~1 minute on DGX Spark

Qualitative Evaluation

Tested across 8 categories (Korean knowledge, logic, creative writing, coding, summarization, math, fact-checking, English):

Korean Knowledge: Accurate, well-structured responses identical to FP16
Logic/Reasoning: Correct problem-solving with proper mathematical notation
Creative Writing: Natural Korean poetry with appropriate imagery
Coding: Correct Python code with proper explanations
Summarization: Concise and accurate 3-sentence summaries
Math: Correct differentiation with step-by-step solutions
Fact-Checking: Accurate historical information
English: Clear, well-organized English explanations

License

This model is released under the Apache 2.0 license, same as the base model skt/A.X-3.1.

Acknowledgments

Quantum Nexus — Quantization, benchmarking, and deployment performed on Quantum Nexus's NVIDIA DGX Spark (Blackwell GB10, 128GB)
SKT for the original A.X-3.1 model
NVIDIA for ModelOpt quantization toolkit and DGX Spark hardware
vLLM team for NVFP4 inference support

Downloads last month: 12

Safetensors

Model size

18B params

Tensor type

F16

F8_E4M3

Model tree for dlsxj101/A.X-3.1-NVFP4

Base model

skt/A.X-3.1

Quantized

(5)

this model