Instructions to use Kaleto/Fallen-Command-111B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Kaleto/Fallen-Command-111B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Kaleto/Fallen-Command-111B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Kaleto/Fallen-Command-111B-NVFP4") model = AutoModelForCausalLM.from_pretrained("Kaleto/Fallen-Command-111B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Kaleto/Fallen-Command-111B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Kaleto/Fallen-Command-111B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Fallen-Command-111B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Kaleto/Fallen-Command-111B-NVFP4
- SGLang
How to use Kaleto/Fallen-Command-111B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Kaleto/Fallen-Command-111B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Fallen-Command-111B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Kaleto/Fallen-Command-111B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Fallen-Command-111B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Kaleto/Fallen-Command-111B-NVFP4 with Docker Model Runner:
docker model run hf.co/Kaleto/Fallen-Command-111B-NVFP4
Fallen-Command-A-111B-v1.1 — NVFP4
NVFP4 (4-bit floating-point, group_size=16) quantization of TheDrummer/Fallen-Command-A-111B-v1.1 — a roleplay/creative finetune of Cohere's Command-A (Cohere2 architecture, 111B). Produced with a custom 3-node heterogeneous distributed pipeline on a personal 2× NVIDIA DGX Spark + RTX 3090 setup. Stored as modelopt NVFP4 weights, served via vLLM's modelopt path.
Cohere2 has a few architectural quirks — tied embeddings, layer_norm_eps, hybrid local/global attention — that needed pipeline-side handling; see Cohere2-specific handling below.
Serving mode on Blackwell (GB10)
On DGX Spark / GB10 with vLLM, this model serves as weight-only FP4: the 4-bit NVFP4 weights are dequantized for each matmul; activations stay BF16. vLLM 0.20.x has no FP4-activation GEMM kernel for Blackwell (sm_120/121), so the NVFP4 path is weight-only regardless of the input_activations field in config.json — on this stack a W4A4-config and a W4A16-config produce bit-identical output. This is the standard, and currently highest-quality, NVFP4 serving mode on Spark. On an FP4-activation-capable stack (TensorRT-LLM, or a future vLLM with a Blackwell FP4 GEMM) the same weights could run as true W4A4.
A practical consequence: because serving is weight-only, the calibration dataset does not affect the served output — the per-tensor weight scales are determined by the weights alone. The calibration pass below is part of the standard modelopt flow but is effectively output-invariant for this serving mode.
Quick facts
| Base model | TheDrummer/Fallen-Command-A-111B-v1.1 (Cohere Command-A finetune) |
| Architecture | Cohere2ForCausalLM — 64 layers, hidden_size 12288, intermediate 36864, 96 attn heads, 8 KV heads, head_dim 128 |
| Notable arch features | Parallel attention/MLP block, hybrid local/global attention (sliding_window 4096, pattern 4), tied input/output embeddings, RoPE θ=50000, 256K max context |
| Original size | ~207 GB (BF16) |
| Quantized size | ~69 GB (14 shards, see Files tab) |
| Quant format | NVFP4 via nvidia-modelopt 0.43.0, group_size=16 |
| lm_head | Kept BF16 (unquantized), listed in quantization_config.ignore |
| Quantized modules | 448 Linear layers (64 × 7: q/k/v/o + gate/up/down) |
| KV cache | Configurable at serve time (FP8 recommended) |
| Calibration | 256-sample pass (~25.7 min); see note above on output-invariance |
| Conversion date | 2026-05-22 |
The hardware: 2× DGX Spark + 1× RTX 3090
The cluster used to produce this artifact:
| Node | GPU | Memory | Role |
|---|---|---|---|
| DX10-01 (GB10 Spark) | NVIDIA GB10 (sm_121) | 128 GB UMA | shard0: layers 0–29 + embed_tokens |
| DX10-02 (GB10 Spark) | NVIDIA GB10 (sm_121) | 128 GB UMA | shard1: layers 30–59 |
| eGPU host (Proxmox VM) | NVIDIA RTX 3090 (sm_86) | 24 GB VRAM | shard2: layers 60–63 + final norm + lm_head |
A 30/30/4-layer split keeps each Spark well inside its 128 GB UMA budget while the 3090 handles the tail 4 layers plus the norm and lm_head. Ray RPC carries cross-node hidden states transparently; the Ampere 3090 has no native FP4 hardware but only handles BF16 calibration math, so the architecture mismatch is irrelevant until inference time — the exported NVFP4 file is identical to what an all-Blackwell cluster would produce.
The pipeline is open-source at github.com/KaletoAI/distrib-nvfp4 (Apache 2.0): N-way layer splits via --shard-layers a,b,c, memory-sorted node placement so the smallest-VRAM node gets the smallest shard, and disk-checkpointed phases for resumable runs.
Cohere2-specific handling
Command-A's Cohere2 architecture needed three fixes beyond the standard modelopt NVFP4 flow:
- Tied embeddings. Cohere2 sets
use_embedding_sharing=true— the output projection reusesmodel.embed_tokens.weightand the checkpoint has no separatelm_head.weight. The head-bearing shard reconstructslm_headfromembed_tokensso it can be exported (BF16) into the merged model. - Norm epsilon name. Cohere2 names the layernorm epsilon
layer_norm_eps(Llama/Mistral userms_norm_eps); the per-layer export template reads the correct attribute with a fallback. - Generation config. Cohere2's
generation_configsetscache_implementation=hybrid, which the 1-layer export template (built withuse_cache=False) rejects. It is dropped during per-layer export and the realgeneration_configis restored in the merged model.
Calibration health-check on the run that produced this artifact — clean, no zero or NaN amax statistics:
- shard0 (layers 0–29 + embed): good=210, zero=0, nan=0
- shard1 (layers 30–59): good=210, zero=0, nan=0
- shard2 (layers 60–63 + norm + lm_head): good=28, zero=0, nan=0
(NVFP4_DEFAULT_CFG inserts 7 weight quantizers per layer.)
After merge, config.json is patched to keep lm_head in quantization_config.ignore, set input_activations.dynamic: true, and inject input_scale=1.0 for every weight quantizer (modelopt 0.43 omits these keys, and vLLM's loader otherwise registers an uninitialized parameter and decodes garbage).
Verification
Loaded and smoke-tested on a single DGX Spark (GB10) with vLLM 0.20.2rc1.dev53 — FlashInferCutlassNvFp4LinearKernel for the NVFP4 GEMM, FlashInfer attention backend:
- Model weights occupy ~62.6 GiB; on a 128 GB UMA Spark at
gpu-memory-utilization 0.90this leaves a ~43 GiB KV-cache pool (≈175K tokens at 4K context). - All test generations are coherent and accurate — e.g. "The capital of France is" → "Paris. The area of France is 212,935 square miles…"; "17 + 25" → "42."
The weight-scale layout was also verified directly: every down_proj.weight_scale is [12288, 2304] (2304 = intermediate 36864 / group_size 16), and there are no stray _quantizer / _double_scale keys — the checkpoint loads with stock vLLM.
A formal throughput benchmark has not been run yet.
Usage
vLLM (serve)
Verified on GB10 with vLLM 0.20.2rc1:
vllm serve /path/to/Fallen-Command-111B-NVFP4 \
--served-model-name Fallen-Command-111B-NVFP4 \
--attention-backend flashinfer \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--enable-prefix-caching \
--port 9007
vLLM auto-detects the modelopt NVFP4 quantization from config.json — no explicit --quantization flag is needed. --gpu-memory-utilization 0.90 leaves enough KV-cache pool for 32K context at max-num-seqs 4 on a 128 GB Spark; drop to 0.85 if you don't need the longer context.
llama-swap entry
"Fallen-Command-111B-NVFP4":
proxy: "http://127.0.0.1:9007"
ttl: 0
checkEndpoint: "/health"
cmd: >-
/home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
--model /home/<user>/models/Fallen-Command-111B-NVFP4
--served-model-name Fallen-Command-111B-NVFP4
--attention-backend flashinfer
--dtype auto
--kv-cache-dtype fp8
--max-model-len 32768
--max-num-seqs 4
--gpu-memory-utilization 0.90
--enable-chunked-prefill
--enable-prefix-caching
--port 9007
--host 127.0.0.1
Prompt format
Use the Cohere / Command chat template (it ships in tokenizer_config.json, so apply_chat_template and vLLM's OpenAI server handle it automatically). See TheDrummer's original card for finetune-specific usage notes.
Files in this repository
model-NNNNN-of-00014.safetensors— 14 shards, NVFP4-packed weights + scales (~69 GB total)model.safetensors.index.json— weight map (1859 keys: 448 quantized linears × 3 scale/weight keys + injectedinput_scalekeys + 64 layernorms + embed + lm_head)config.json— Cohere2 config withquantization_config.ignore=["lm_head"]andinput_activations.dynamic: truehf_quant_config.json,generation_config.json— auxiliary modelopt + generation configstokenizer.json,tokenizer_config.json,special_tokens_map.json— Command-A tokenizer, untouched from upstream
Acknowledgments
- TheDrummer for the Fallen-Command-A-111B finetune
- Cohere / Cohere Labs for the Command-A base model and the Cohere2 architecture
- NVIDIA for the DGX Spark / GB10 platform, the NVFP4 format, and modelopt
- vLLM project for modelopt NVFP4 inference support
License
This NVFP4 quantization inherits the license of the base model TheDrummer/Fallen-Command-A-111B-v1.1, which is derived from Cohere's Command-A — released under CC-BY-NC 4.0 with Cohere's Acceptable Use Policy. For research, evaluation, and personal non-commercial use only.
- Pipeline code (Apache 2.0): https://github.com/KaletoAI/distrib-nvfp4
Status
Single-author release. Feedback welcome — on the model artifact (vLLM behaviour, sampling, RP quality) and on the pipeline that built it.
- Downloads last month
- 219
Model tree for Kaleto/Fallen-Command-111B-NVFP4
Base model
TheDrummer/Fallen-Command-A-111B-v1.1