Kaleto commited on
Commit
ecc3320
·
verified ·
1 Parent(s): 6ad6824

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +184 -0
README.md ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ base_model: TheDrummer/Fallen-Command-A-111B-v1.1
4
+ base_model_relation: quantized
5
+ language:
6
+ - en
7
+ library_name: transformers
8
+ tags:
9
+ - nvfp4
10
+ - fp4
11
+ - modelopt
12
+ - vllm
13
+ - cohere2
14
+ - command-a
15
+ - dgx-spark
16
+ - gb10
17
+ - roleplay
18
+ pipeline_tag: text-generation
19
+ ---
20
+
21
+ # Fallen-Command-A-111B-v1.1 — NVFP4
22
+
23
+ NVFP4 (4-bit floating-point, `group_size=16`) quantization of [TheDrummer/Fallen-Command-A-111B-v1.1](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1) — a roleplay/creative finetune of Cohere's **Command-A** (Cohere2 architecture, 111B). Produced with a custom **3-node heterogeneous distributed pipeline** on a personal **2× NVIDIA DGX Spark + RTX 3090** setup. Stored as modelopt NVFP4 weights, served via vLLM's modelopt path.
24
+
25
+ Cohere2 has a few architectural quirks — tied embeddings, `layer_norm_eps`, hybrid local/global attention — that needed pipeline-side handling; see [Cohere2-specific handling](#cohere2-specific-handling) below.
26
+
27
+ ---
28
+
29
+ ## Serving mode on Blackwell (GB10)
30
+
31
+ On DGX Spark / GB10 with vLLM, this model serves as **weight-only FP4**: the 4-bit NVFP4 weights are dequantized for each matmul; activations stay BF16. vLLM 0.20.x has no FP4-activation GEMM kernel for Blackwell (sm_120/121), so the NVFP4 path is weight-only regardless of the `input_activations` field in `config.json` — on this stack a W4A4-config and a W4A16-config produce bit-identical output. This is the standard, and currently highest-quality, NVFP4 serving mode on Spark. On an FP4-activation-capable stack (TensorRT-LLM, or a future vLLM with a Blackwell FP4 GEMM) the same weights could run as true W4A4.
32
+
33
+ A practical consequence: because serving is weight-only, the **calibration dataset does not affect the served output** — the per-tensor weight scales are determined by the weights alone. The calibration pass below is part of the standard modelopt flow but is effectively output-invariant for this serving mode.
34
+
35
+ ---
36
+
37
+ ## Quick facts
38
+
39
+ | | |
40
+ |---|---|
41
+ | **Base model** | [TheDrummer/Fallen-Command-A-111B-v1.1](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1) (Cohere Command-A finetune) |
42
+ | **Architecture** | Cohere2ForCausalLM — 64 layers, hidden_size 12288, intermediate 36864, 96 attn heads, 8 KV heads, head_dim 128 |
43
+ | **Notable arch features** | Parallel attention/MLP block, hybrid local/global attention (sliding_window 4096, pattern 4), tied input/output embeddings, RoPE θ=50000, 256K max context |
44
+ | **Original size** | ~207 GB (BF16) |
45
+ | **Quantized size** | ~69 GB (14 shards, see Files tab) |
46
+ | **Quant format** | NVFP4 via [nvidia-modelopt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.43.0, `group_size=16` |
47
+ | **lm_head** | Kept BF16 (unquantized), listed in `quantization_config.ignore` |
48
+ | **Quantized modules** | 448 Linear layers (64 × 7: q/k/v/o + gate/up/down) |
49
+ | **KV cache** | Configurable at serve time (FP8 recommended) |
50
+ | **Calibration** | 256-sample pass (~25.7 min); see note above on output-invariance |
51
+ | **Conversion date** | 2026-05-22 |
52
+
53
+ ---
54
+
55
+ ## The hardware: 2× DGX Spark + 1× RTX 3090
56
+
57
+ The cluster used to produce this artifact:
58
+
59
+ | Node | GPU | Memory | Role |
60
+ |---|---|---|---|
61
+ | DX10-01 (GB10 Spark) | NVIDIA GB10 (sm_121) | 128 GB UMA | shard0: layers 0–29 + embed_tokens |
62
+ | DX10-02 (GB10 Spark) | NVIDIA GB10 (sm_121) | 128 GB UMA | shard1: layers 30–59 |
63
+ | eGPU host (Proxmox VM) | NVIDIA RTX 3090 (sm_86) | 24 GB VRAM | shard2: layers 60–63 + final norm + lm_head |
64
+
65
+ A 30/30/4-layer split keeps each Spark well inside its 128 GB UMA budget while the 3090 handles the tail 4 layers plus the norm and lm_head. Ray RPC carries cross-node hidden states transparently; the Ampere 3090 has no native FP4 hardware but only handles BF16 calibration math, so the architecture mismatch is irrelevant until inference time — the exported NVFP4 file is identical to what an all-Blackwell cluster would produce.
66
+
67
+ The pipeline is open-source at **[github.com/KaletoAI/distrib-nvfp4](https://github.com/KaletoAI/distrib-nvfp4)** (Apache 2.0): N-way layer splits via `--shard-layers a,b,c`, memory-sorted node placement so the smallest-VRAM node gets the smallest shard, and disk-checkpointed phases for resumable runs.
68
+
69
+ ---
70
+
71
+ ## Cohere2-specific handling
72
+
73
+ Command-A's Cohere2 architecture needed three fixes beyond the standard modelopt NVFP4 flow:
74
+
75
+ 1. **Tied embeddings.** Cohere2 sets `use_embedding_sharing=true` — the output projection reuses `model.embed_tokens.weight` and the checkpoint has no separate `lm_head.weight`. The head-bearing shard reconstructs `lm_head` from `embed_tokens` so it can be exported (BF16) into the merged model.
76
+ 2. **Norm epsilon name.** Cohere2 names the layernorm epsilon `layer_norm_eps` (Llama/Mistral use `rms_norm_eps`); the per-layer export template reads the correct attribute with a fallback.
77
+ 3. **Generation config.** Cohere2's `generation_config` sets `cache_implementation=hybrid`, which the 1-layer export template (built with `use_cache=False`) rejects. It is dropped during per-layer export and the real `generation_config` is restored in the merged model.
78
+
79
+ Calibration health-check on the run that produced this artifact — clean, no zero or NaN amax statistics:
80
+
81
+ - shard0 (layers 0–29 + embed): **good=210, zero=0, nan=0**
82
+ - shard1 (layers 30–59): **good=210, zero=0, nan=0**
83
+ - shard2 (layers 60–63 + norm + lm_head): **good=28, zero=0, nan=0**
84
+
85
+ (`NVFP4_DEFAULT_CFG` inserts 7 weight quantizers per layer.)
86
+
87
+ After merge, `config.json` is patched to keep `lm_head` in `quantization_config.ignore`, set `input_activations.dynamic: true`, and inject `input_scale=1.0` for every weight quantizer (modelopt 0.43 omits these keys, and vLLM's loader otherwise registers an uninitialized parameter and decodes garbage).
88
+
89
+ ---
90
+
91
+ ## Verification
92
+
93
+ Loaded and smoke-tested on a single DGX Spark (GB10) with vLLM `0.20.2rc1.dev53` — `FlashInferCutlassNvFp4LinearKernel` for the NVFP4 GEMM, FlashInfer attention backend:
94
+
95
+ - Model weights occupy **~62.6 GiB**; on a 128 GB UMA Spark at `gpu-memory-utilization 0.90` this leaves a ~43 GiB KV-cache pool (≈175K tokens at 4K context).
96
+ - All test generations are coherent and accurate — e.g. *"The capital of France is"* → *"Paris. The area of France is 212,935 square miles…"*; *"17 + 25"* → *"42."*
97
+
98
+ The weight-scale layout was also verified directly: every `down_proj.weight_scale` is `[12288, 2304]` (2304 = intermediate 36864 / `group_size` 16), and there are no stray `_quantizer` / `_double_scale` keys — the checkpoint loads with stock vLLM.
99
+
100
+ A formal throughput benchmark has not been run yet.
101
+
102
+ ---
103
+
104
+ ## Usage
105
+
106
+ ### vLLM (serve)
107
+
108
+ Verified on GB10 with vLLM `0.20.2rc1`:
109
+
110
+ ```bash
111
+ vllm serve /path/to/Fallen-Command-111B-NVFP4 \
112
+ --served-model-name Fallen-Command-111B-NVFP4 \
113
+ --attention-backend flashinfer \
114
+ --dtype auto \
115
+ --kv-cache-dtype fp8 \
116
+ --max-model-len 32768 \
117
+ --max-num-seqs 4 \
118
+ --gpu-memory-utilization 0.90 \
119
+ --enable-chunked-prefill \
120
+ --enable-prefix-caching \
121
+ --port 9007
122
+ ```
123
+
124
+ vLLM auto-detects the modelopt NVFP4 quantization from `config.json` — no explicit `--quantization` flag is needed. `--gpu-memory-utilization 0.90` leaves enough KV-cache pool for 32K context at `max-num-seqs 4` on a 128 GB Spark; drop to 0.85 if you don't need the longer context.
125
+
126
+ ### llama-swap entry
127
+
128
+ ```yaml
129
+ "Fallen-Command-111B-NVFP4":
130
+ proxy: "http://127.0.0.1:9007"
131
+ ttl: 0
132
+ checkEndpoint: "/health"
133
+ cmd: >-
134
+ /home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
135
+ --model /home/<user>/models/Fallen-Command-111B-NVFP4
136
+ --served-model-name Fallen-Command-111B-NVFP4
137
+ --attention-backend flashinfer
138
+ --dtype auto
139
+ --kv-cache-dtype fp8
140
+ --max-model-len 32768
141
+ --max-num-seqs 4
142
+ --gpu-memory-utilization 0.90
143
+ --enable-chunked-prefill
144
+ --enable-prefix-caching
145
+ --port 9007
146
+ --host 127.0.0.1
147
+ ```
148
+
149
+ ### Prompt format
150
+
151
+ Use the **Cohere / Command chat template** (it ships in `tokenizer_config.json`, so `apply_chat_template` and vLLM's OpenAI server handle it automatically). See [TheDrummer's original card](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1) for finetune-specific usage notes.
152
+
153
+ ---
154
+
155
+ ## Files in this repository
156
+
157
+ - `model-NNNNN-of-00014.safetensors` — 14 shards, NVFP4-packed weights + scales (~69 GB total)
158
+ - `model.safetensors.index.json` — weight map (1859 keys: 448 quantized linears × 3 scale/weight keys + injected `input_scale` keys + 64 layernorms + embed + lm_head)
159
+ - `config.json` — Cohere2 config with `quantization_config.ignore=["lm_head"]` and `input_activations.dynamic: true`
160
+ - `hf_quant_config.json`, `generation_config.json` — auxiliary modelopt + generation configs
161
+ - `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` — Command-A tokenizer, untouched from upstream
162
+
163
+ ---
164
+
165
+ ## Acknowledgments
166
+
167
+ - **[TheDrummer](https://huggingface.co/TheDrummer)** for the Fallen-Command-A-111B finetune
168
+ - **[Cohere / Cohere Labs](https://huggingface.co/CohereLabs)** for the Command-A base model and the Cohere2 architecture
169
+ - **NVIDIA** for the DGX Spark / GB10 platform, the NVFP4 format, and [modelopt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
170
+ - **vLLM project** for modelopt NVFP4 inference support
171
+
172
+ ---
173
+
174
+ ## License
175
+
176
+ This NVFP4 quantization inherits the license of the base model TheDrummer/Fallen-Command-A-111B-v1.1, which is derived from Cohere's **Command-A** — released under **CC-BY-NC 4.0** with Cohere's Acceptable Use Policy. **For research, evaluation, and personal non-commercial use only.**
177
+
178
+ - Pipeline code (Apache 2.0): https://github.com/KaletoAI/distrib-nvfp4
179
+
180
+ ---
181
+
182
+ ## Status
183
+
184
+ Single-author release. Feedback welcome — on the model artifact (vLLM behaviour, sampling, RP quality) and on the pipeline that built it.