Instructions to use dlsxj101/A.X-3.1-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dlsxj101/A.X-3.1-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="dlsxj101/A.X-3.1-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("dlsxj101/A.X-3.1-NVFP4") model = AutoModelForCausalLM.from_pretrained("dlsxj101/A.X-3.1-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use dlsxj101/A.X-3.1-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dlsxj101/A.X-3.1-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dlsxj101/A.X-3.1-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dlsxj101/A.X-3.1-NVFP4
- SGLang
How to use dlsxj101/A.X-3.1-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "dlsxj101/A.X-3.1-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dlsxj101/A.X-3.1-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "dlsxj101/A.X-3.1-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dlsxj101/A.X-3.1-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use dlsxj101/A.X-3.1-NVFP4 with Docker Model Runner:
docker model run hf.co/dlsxj101/A.X-3.1-NVFP4
skt/A.X-3.1 — NVFP4 Quantized
NVIDIA FP4 (NVFP4) quantized version of skt/A.X-3.1, a 35B-parameter Korean large language model.
Model Details
| Property | Value |
|---|---|
| Base Model | skt/A.X-3.1 (35B params) |
| Architecture | LlamaForCausalLM |
| Quantization | NVFP4 (4-bit floating point, Blackwell-native) |
| Quantization Tool | nvidia-modelopt v0.44.0 |
| Quantization Config | NVFP4_DEFAULT_CFG (max algorithm) |
| Model Size | ~20.5 GB (3 shards) |
| Original Size | ~64.6 GB (FP16) |
| Compression Ratio | 3.15x |
| Context Length | 32,768 tokens |
| Vocab Size | 102,400 |
Performance
Benchmarked on NVIDIA DGX Spark (Blackwell GB10, 128GB unified LPDDR5X):
| Metric | NVFP4 (this model) | FP16 Original |
|---|---|---|
| PPL (8 Korean eval texts) | 4.49 | 4.88 |
| Speed (vLLM 0.19.1) | ~10 t/s | ~3.5 t/s |
| Memory | 20.5 GB | 64.6 GB |
PPL (Perplexity) measured on 8 diverse Korean texts (289 tokens total) using vLLM logprobs API. Lower is better.
Key finding: NVFP4 quantization achieves virtually identical quality to FP16 while being ~3x faster and using ~3x less memory.
How to Use
With vLLM (Recommended)
# Requires NVIDIA Blackwell GPU (sm_121a) and vLLM with NVFP4 support
vllm serve dlsxj101/A.X-3.1-NVFP4 \
--quantization fp4 \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85
With vLLM Docker
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
ghcr.io/bjk110/vllm-spark:v019-ngc2603 \
python3 -m vllm.entrypoints.openai.api_server \
--model dlsxj101/A.X-3.1-NVFP4 \
--quantization fp4 \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85 \
--host 0.0.0.0 --port 8000
OpenAI-Compatible API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="dlsxj101/A.X-3.1-NVFP4",
messages=[{"role": "user", "content": "한국의 AI 산업 현황을 설명해주세요."}],
max_tokens=1024,
temperature=0.7,
)
print(response.choices[0].message.content)
Hardware Requirements
- GPU: NVIDIA Blackwell architecture (GB10, GB100, GB200, B100, B200)
- NVFP4 is a Blackwell-native format computed directly on Tensor Cores
- Not compatible with pre-Blackwell GPUs (A100, H100, etc.)
- Memory: ~21 GB GPU memory minimum
- Software: vLLM >= 0.19.0 with NVFP4 support
Quantization Details
- Algorithm:
max(NVFP4_DEFAULT_CFG) — measures maximum activation values per tensor - Group Size: 16
- Excluded Modules:
lm_head(kept in FP16) - Calibration: 8 English text samples (sufficient for
maxalgorithm) - Quantization Time: ~1 minute on DGX Spark
Qualitative Evaluation
Tested across 8 categories (Korean knowledge, logic, creative writing, coding, summarization, math, fact-checking, English):
- Korean Knowledge: Accurate, well-structured responses identical to FP16
- Logic/Reasoning: Correct problem-solving with proper mathematical notation
- Creative Writing: Natural Korean poetry with appropriate imagery
- Coding: Correct Python code with proper explanations
- Summarization: Concise and accurate 3-sentence summaries
- Math: Correct differentiation with step-by-step solutions
- Fact-Checking: Accurate historical information
- English: Clear, well-organized English explanations
License
This model is released under the Apache 2.0 license, same as the base model skt/A.X-3.1.
Acknowledgments
- Quantum Nexus — Quantization, benchmarking, and deployment performed on Quantum Nexus's NVIDIA DGX Spark (Blackwell GB10, 128GB)
- SKT for the original A.X-3.1 model
- NVIDIA for ModelOpt quantization toolkit and DGX Spark hardware
- vLLM team for NVFP4 inference support
- Downloads last month
- -
Model tree for dlsxj101/A.X-3.1-NVFP4
Base model
skt/A.X-3.1