Text Generation
Transformers
Safetensors
English
cohere2
nvfp4
fp4
modelopt
vllm
command-a
dgx-spark
gb10
roleplay
conversational
8-bit precision
Instructions to use Kaleto/Fallen-Command-111B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Kaleto/Fallen-Command-111B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Kaleto/Fallen-Command-111B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Kaleto/Fallen-Command-111B-NVFP4") model = AutoModelForCausalLM.from_pretrained("Kaleto/Fallen-Command-111B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Kaleto/Fallen-Command-111B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Kaleto/Fallen-Command-111B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Fallen-Command-111B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Kaleto/Fallen-Command-111B-NVFP4
- SGLang
How to use Kaleto/Fallen-Command-111B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Kaleto/Fallen-Command-111B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Fallen-Command-111B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Kaleto/Fallen-Command-111B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Fallen-Command-111B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Kaleto/Fallen-Command-111B-NVFP4 with Docker Model Runner:
docker model run hf.co/Kaleto/Fallen-Command-111B-NVFP4
File size: 9,791 Bytes
ecc3320 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | ---
license: other
base_model: TheDrummer/Fallen-Command-A-111B-v1.1
base_model_relation: quantized
language:
- en
library_name: transformers
tags:
- nvfp4
- fp4
- modelopt
- vllm
- cohere2
- command-a
- dgx-spark
- gb10
- roleplay
pipeline_tag: text-generation
---
# Fallen-Command-A-111B-v1.1 — NVFP4
NVFP4 (4-bit floating-point, `group_size=16`) quantization of [TheDrummer/Fallen-Command-A-111B-v1.1](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1) — a roleplay/creative finetune of Cohere's **Command-A** (Cohere2 architecture, 111B). Produced with a custom **3-node heterogeneous distributed pipeline** on a personal **2× NVIDIA DGX Spark + RTX 3090** setup. Stored as modelopt NVFP4 weights, served via vLLM's modelopt path.
Cohere2 has a few architectural quirks — tied embeddings, `layer_norm_eps`, hybrid local/global attention — that needed pipeline-side handling; see [Cohere2-specific handling](#cohere2-specific-handling) below.
---
## Serving mode on Blackwell (GB10)
On DGX Spark / GB10 with vLLM, this model serves as **weight-only FP4**: the 4-bit NVFP4 weights are dequantized for each matmul; activations stay BF16. vLLM 0.20.x has no FP4-activation GEMM kernel for Blackwell (sm_120/121), so the NVFP4 path is weight-only regardless of the `input_activations` field in `config.json` — on this stack a W4A4-config and a W4A16-config produce bit-identical output. This is the standard, and currently highest-quality, NVFP4 serving mode on Spark. On an FP4-activation-capable stack (TensorRT-LLM, or a future vLLM with a Blackwell FP4 GEMM) the same weights could run as true W4A4.
A practical consequence: because serving is weight-only, the **calibration dataset does not affect the served output** — the per-tensor weight scales are determined by the weights alone. The calibration pass below is part of the standard modelopt flow but is effectively output-invariant for this serving mode.
---
## Quick facts
| | |
|---|---|
| **Base model** | [TheDrummer/Fallen-Command-A-111B-v1.1](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1) (Cohere Command-A finetune) |
| **Architecture** | Cohere2ForCausalLM — 64 layers, hidden_size 12288, intermediate 36864, 96 attn heads, 8 KV heads, head_dim 128 |
| **Notable arch features** | Parallel attention/MLP block, hybrid local/global attention (sliding_window 4096, pattern 4), tied input/output embeddings, RoPE θ=50000, 256K max context |
| **Original size** | ~207 GB (BF16) |
| **Quantized size** | ~69 GB (14 shards, see Files tab) |
| **Quant format** | NVFP4 via [nvidia-modelopt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.43.0, `group_size=16` |
| **lm_head** | Kept BF16 (unquantized), listed in `quantization_config.ignore` |
| **Quantized modules** | 448 Linear layers (64 × 7: q/k/v/o + gate/up/down) |
| **KV cache** | Configurable at serve time (FP8 recommended) |
| **Calibration** | 256-sample pass (~25.7 min); see note above on output-invariance |
| **Conversion date** | 2026-05-22 |
---
## The hardware: 2× DGX Spark + 1× RTX 3090
The cluster used to produce this artifact:
| Node | GPU | Memory | Role |
|---|---|---|---|
| DX10-01 (GB10 Spark) | NVIDIA GB10 (sm_121) | 128 GB UMA | shard0: layers 0–29 + embed_tokens |
| DX10-02 (GB10 Spark) | NVIDIA GB10 (sm_121) | 128 GB UMA | shard1: layers 30–59 |
| eGPU host (Proxmox VM) | NVIDIA RTX 3090 (sm_86) | 24 GB VRAM | shard2: layers 60–63 + final norm + lm_head |
A 30/30/4-layer split keeps each Spark well inside its 128 GB UMA budget while the 3090 handles the tail 4 layers plus the norm and lm_head. Ray RPC carries cross-node hidden states transparently; the Ampere 3090 has no native FP4 hardware but only handles BF16 calibration math, so the architecture mismatch is irrelevant until inference time — the exported NVFP4 file is identical to what an all-Blackwell cluster would produce.
The pipeline is open-source at **[github.com/KaletoAI/distrib-nvfp4](https://github.com/KaletoAI/distrib-nvfp4)** (Apache 2.0): N-way layer splits via `--shard-layers a,b,c`, memory-sorted node placement so the smallest-VRAM node gets the smallest shard, and disk-checkpointed phases for resumable runs.
---
## Cohere2-specific handling
Command-A's Cohere2 architecture needed three fixes beyond the standard modelopt NVFP4 flow:
1. **Tied embeddings.** Cohere2 sets `use_embedding_sharing=true` — the output projection reuses `model.embed_tokens.weight` and the checkpoint has no separate `lm_head.weight`. The head-bearing shard reconstructs `lm_head` from `embed_tokens` so it can be exported (BF16) into the merged model.
2. **Norm epsilon name.** Cohere2 names the layernorm epsilon `layer_norm_eps` (Llama/Mistral use `rms_norm_eps`); the per-layer export template reads the correct attribute with a fallback.
3. **Generation config.** Cohere2's `generation_config` sets `cache_implementation=hybrid`, which the 1-layer export template (built with `use_cache=False`) rejects. It is dropped during per-layer export and the real `generation_config` is restored in the merged model.
Calibration health-check on the run that produced this artifact — clean, no zero or NaN amax statistics:
- shard0 (layers 0–29 + embed): **good=210, zero=0, nan=0**
- shard1 (layers 30–59): **good=210, zero=0, nan=0**
- shard2 (layers 60–63 + norm + lm_head): **good=28, zero=0, nan=0**
(`NVFP4_DEFAULT_CFG` inserts 7 weight quantizers per layer.)
After merge, `config.json` is patched to keep `lm_head` in `quantization_config.ignore`, set `input_activations.dynamic: true`, and inject `input_scale=1.0` for every weight quantizer (modelopt 0.43 omits these keys, and vLLM's loader otherwise registers an uninitialized parameter and decodes garbage).
---
## Verification
Loaded and smoke-tested on a single DGX Spark (GB10) with vLLM `0.20.2rc1.dev53` — `FlashInferCutlassNvFp4LinearKernel` for the NVFP4 GEMM, FlashInfer attention backend:
- Model weights occupy **~62.6 GiB**; on a 128 GB UMA Spark at `gpu-memory-utilization 0.90` this leaves a ~43 GiB KV-cache pool (≈175K tokens at 4K context).
- All test generations are coherent and accurate — e.g. *"The capital of France is"* → *"Paris. The area of France is 212,935 square miles…"*; *"17 + 25"* → *"42."*
The weight-scale layout was also verified directly: every `down_proj.weight_scale` is `[12288, 2304]` (2304 = intermediate 36864 / `group_size` 16), and there are no stray `_quantizer` / `_double_scale` keys — the checkpoint loads with stock vLLM.
A formal throughput benchmark has not been run yet.
---
## Usage
### vLLM (serve)
Verified on GB10 with vLLM `0.20.2rc1`:
```bash
vllm serve /path/to/Fallen-Command-111B-NVFP4 \
--served-model-name Fallen-Command-111B-NVFP4 \
--attention-backend flashinfer \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--enable-prefix-caching \
--port 9007
```
vLLM auto-detects the modelopt NVFP4 quantization from `config.json` — no explicit `--quantization` flag is needed. `--gpu-memory-utilization 0.90` leaves enough KV-cache pool for 32K context at `max-num-seqs 4` on a 128 GB Spark; drop to 0.85 if you don't need the longer context.
### llama-swap entry
```yaml
"Fallen-Command-111B-NVFP4":
proxy: "http://127.0.0.1:9007"
ttl: 0
checkEndpoint: "/health"
cmd: >-
/home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
--model /home/<user>/models/Fallen-Command-111B-NVFP4
--served-model-name Fallen-Command-111B-NVFP4
--attention-backend flashinfer
--dtype auto
--kv-cache-dtype fp8
--max-model-len 32768
--max-num-seqs 4
--gpu-memory-utilization 0.90
--enable-chunked-prefill
--enable-prefix-caching
--port 9007
--host 127.0.0.1
```
### Prompt format
Use the **Cohere / Command chat template** (it ships in `tokenizer_config.json`, so `apply_chat_template` and vLLM's OpenAI server handle it automatically). See [TheDrummer's original card](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1) for finetune-specific usage notes.
---
## Files in this repository
- `model-NNNNN-of-00014.safetensors` — 14 shards, NVFP4-packed weights + scales (~69 GB total)
- `model.safetensors.index.json` — weight map (1859 keys: 448 quantized linears × 3 scale/weight keys + injected `input_scale` keys + 64 layernorms + embed + lm_head)
- `config.json` — Cohere2 config with `quantization_config.ignore=["lm_head"]` and `input_activations.dynamic: true`
- `hf_quant_config.json`, `generation_config.json` — auxiliary modelopt + generation configs
- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` — Command-A tokenizer, untouched from upstream
---
## Acknowledgments
- **[TheDrummer](https://huggingface.co/TheDrummer)** for the Fallen-Command-A-111B finetune
- **[Cohere / Cohere Labs](https://huggingface.co/CohereLabs)** for the Command-A base model and the Cohere2 architecture
- **NVIDIA** for the DGX Spark / GB10 platform, the NVFP4 format, and [modelopt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
- **vLLM project** for modelopt NVFP4 inference support
---
## License
This NVFP4 quantization inherits the license of the base model TheDrummer/Fallen-Command-A-111B-v1.1, which is derived from Cohere's **Command-A** — released under **CC-BY-NC 4.0** with Cohere's Acceptable Use Policy. **For research, evaluation, and personal non-commercial use only.**
- Pipeline code (Apache 2.0): https://github.com/KaletoAI/distrib-nvfp4
---
## Status
Single-author release. Feedback welcome — on the model artifact (vLLM behaviour, sampling, RP quality) and on the pipeline that built it.
|