GadflyII commited on
Commit
f75ee39
·
verified ·
1 Parent(s): 162c8fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -27
README.md CHANGED
@@ -12,6 +12,9 @@ license: apache-2.0
12
  pipeline_tag: text-generation
13
  ---
14
 
 
 
 
15
  # Qwen3-Coder-Next-NVFP4
16
 
17
  NVFP4 quantized version of [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) (80B-A3B).
@@ -32,7 +35,7 @@ NVFP4 quantized version of [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Q
32
 
33
  ## Quantization Details
34
 
35
- Quantized using [llmcompressor](https://github.com/vllm-project/llm-compressor) 0.9.0.1 following RedHatAI's approach.
36
 
37
  ```python
38
  NUM_CALIBRATION_SAMPLES = 20
@@ -60,43 +63,20 @@ ignore = [
60
 
61
  ### Context Length Testing
62
 
63
- Successfully tested up to **128K tokens** with FP8 KV cache (~95GB VRAM per GPU).
64
 
65
  ## Usage with vLLM
66
 
67
- Requires vLLM with NVFP4 support (0.16.0+).
68
 
69
  ```bash
70
- # Standard serving (TP=2)
71
- vllm serve GadflyII/Qwen3-Coder-Next-NVFP4 \
72
- --tensor-parallel-size 2 \
73
- --max-model-len 49152
74
-
75
- # Extended context (128K with FP8 KV cache)
76
  vllm serve GadflyII/Qwen3-Coder-Next-NVFP4 \
77
  --tensor-parallel-size 2 \
78
  --max-model-len 131072 \
79
  --kv-cache-dtype fp8
80
-
81
- # Expert Parallelism (alternative to TP)
82
- vllm serve GadflyII/Qwen3-Coder-Next-NVFP4 \
83
- --tensor-parallel-size 1 \
84
- --expert-parallel-size 2 \
85
- --max-model-len 131072 \
86
- --kv-cache-dtype fp8
87
  ```
88
 
89
- ### Notes for Blackwell GPUs (SM120)
90
-
91
- - First startup may take longer due to CUDA graph compilation
92
- - SymmMemCommunicator not supported on device capability 12.0 (handled automatically)
93
- - Some CUTLASS TMA WS grouped gemm tactics may fail during autotuning (non-fatal)
94
-
95
- ## Hardware Requirements
96
-
97
- - **Minimum**: 2x 48GB GPUs (for inference at shorter context)
98
- - **Recommended**: 2x 96GB GPUs (for 128K context with FP8 KV cache)
99
-
100
  ## License
101
 
102
  Apache 2.0 (same as base model)
 
12
  pipeline_tag: text-generation
13
  ---
14
 
15
+ # Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
16
+ https://github.com/Gadflyii/vllm/tree/main
17
+
18
  # Qwen3-Coder-Next-NVFP4
19
 
20
  NVFP4 quantized version of [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) (80B-A3B).
 
35
 
36
  ## Quantization Details
37
 
38
+ Quantized using [llmcompressor](https://github.com/vllm-project/llm-compressor) 0.9.0.1.
39
 
40
  ```python
41
  NUM_CALIBRATION_SAMPLES = 20
 
63
 
64
  ### Context Length Testing
65
 
66
+ Successfully tested up to **128K tokens** with FP8 KV cache (Not enough VRAM to test any higher context).
67
 
68
  ## Usage with vLLM
69
 
70
+ Requires vLLM with NVFP4 support (0.16.0+), Transformers 5.0.0+
71
 
72
  ```bash
73
+ #vllm Serving
 
 
 
 
 
74
  vllm serve GadflyII/Qwen3-Coder-Next-NVFP4 \
75
  --tensor-parallel-size 2 \
76
  --max-model-len 131072 \
77
  --kv-cache-dtype fp8
 
 
 
 
 
 
 
78
  ```
79
 
 
 
 
 
 
 
 
 
 
 
 
80
  ## License
81
 
82
  Apache 2.0 (same as base model)