Instructions to use Kaleto/Fallen-Command-111B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Kaleto/Fallen-Command-111B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Kaleto/Fallen-Command-111B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Kaleto/Fallen-Command-111B-NVFP4")
model = AutoModelForCausalLM.from_pretrained("Kaleto/Fallen-Command-111B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Kaleto/Fallen-Command-111B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Kaleto/Fallen-Command-111B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/Fallen-Command-111B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Kaleto/Fallen-Command-111B-NVFP4

SGLang

How to use Kaleto/Fallen-Command-111B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Kaleto/Fallen-Command-111B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/Fallen-Command-111B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Kaleto/Fallen-Command-111B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/Fallen-Command-111B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Kaleto/Fallen-Command-111B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/Kaleto/Fallen-Command-111B-NVFP4
```

Fallen-Command-111B-NVFP4 / README.md

Kaleto

Upload README.md with huggingface_hub

ecc3320 verified 4 days ago

preview code

raw

history blame contribute delete

9.79 kB

	---
	license: other
	base_model: TheDrummer/Fallen-Command-A-111B-v1.1
	base_model_relation: quantized
	language:
	- en
	library_name: transformers
	tags:
	- nvfp4
	- fp4
	- modelopt
	- vllm
	- cohere2
	- command-a
	- dgx-spark
	- gb10
	- roleplay
	pipeline_tag: text-generation
	---

	# Fallen-Command-A-111B-v1.1 — NVFP4

	NVFP4 (4-bit floating-point, `group_size=16`) quantization of [TheDrummer/Fallen-Command-A-111B-v1.1](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1) — a roleplay/creative finetune of Cohere's Command-A (Cohere2 architecture, 111B). Produced with a custom 3-node heterogeneous distributed pipeline on a personal 2× NVIDIA DGX Spark + RTX 3090 setup. Stored as modelopt NVFP4 weights, served via vLLM's modelopt path.

	Cohere2 has a few architectural quirks — tied embeddings, `layer_norm_eps`, hybrid local/global attention — that needed pipeline-side handling; see [Cohere2-specific handling](#cohere2-specific-handling) below.

	---

	## Serving mode on Blackwell (GB10)

	On DGX Spark / GB10 with vLLM, this model serves as weight-only FP4: the 4-bit NVFP4 weights are dequantized for each matmul; activations stay BF16. vLLM 0.20.x has no FP4-activation GEMM kernel for Blackwell (sm_120/121), so the NVFP4 path is weight-only regardless of the `input_activations` field in `config.json` — on this stack a W4A4-config and a W4A16-config produce bit-identical output. This is the standard, and currently highest-quality, NVFP4 serving mode on Spark. On an FP4-activation-capable stack (TensorRT-LLM, or a future vLLM with a Blackwell FP4 GEMM) the same weights could run as true W4A4.

	A practical consequence: because serving is weight-only, the calibration dataset does not affect the served output — the per-tensor weight scales are determined by the weights alone. The calibration pass below is part of the standard modelopt flow but is effectively output-invariant for this serving mode.

	---

	## Quick facts

	\| \| \|
	\|---\|---\|
	\| Base model \| [TheDrummer/Fallen-Command-A-111B-v1.1](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1) (Cohere Command-A finetune) \|
	\| Architecture \| Cohere2ForCausalLM — 64 layers, hidden_size 12288, intermediate 36864, 96 attn heads, 8 KV heads, head_dim 128 \|
	\| Notable arch features \| Parallel attention/MLP block, hybrid local/global attention (sliding_window 4096, pattern 4), tied input/output embeddings, RoPE θ=50000, 256K max context \|
	\| Original size \| ~207 GB (BF16) \|
	\| Quantized size \| ~69 GB (14 shards, see Files tab) \|
	\| Quant format \| NVFP4 via [nvidia-modelopt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.43.0, `group_size=16` \|
	\| lm_head \| Kept BF16 (unquantized), listed in `quantization_config.ignore` \|
	\| Quantized modules \| 448 Linear layers (64 × 7: q/k/v/o + gate/up/down) \|
	\| KV cache \| Configurable at serve time (FP8 recommended) \|
	\| Calibration \| 256-sample pass (~25.7 min); see note above on output-invariance \|
	\| Conversion date \| 2026-05-22 \|

	---

	## The hardware: 2× DGX Spark + 1× RTX 3090

	The cluster used to produce this artifact:

	\| Node \| GPU \| Memory \| Role \|
	\|---\|---\|---\|---\|
	\| DX10-01 (GB10 Spark) \| NVIDIA GB10 (sm_121) \| 128 GB UMA \| shard0: layers 0–29 + embed_tokens \|
	\| DX10-02 (GB10 Spark) \| NVIDIA GB10 (sm_121) \| 128 GB UMA \| shard1: layers 30–59 \|
	\| eGPU host (Proxmox VM) \| NVIDIA RTX 3090 (sm_86) \| 24 GB VRAM \| shard2: layers 60–63 + final norm + lm_head \|

	A 30/30/4-layer split keeps each Spark well inside its 128 GB UMA budget while the 3090 handles the tail 4 layers plus the norm and lm_head. Ray RPC carries cross-node hidden states transparently; the Ampere 3090 has no native FP4 hardware but only handles BF16 calibration math, so the architecture mismatch is irrelevant until inference time — the exported NVFP4 file is identical to what an all-Blackwell cluster would produce.

	The pipeline is open-source at [github.com/KaletoAI/distrib-nvfp4](https://github.com/KaletoAI/distrib-nvfp4) (Apache 2.0): N-way layer splits via `--shard-layers a,b,c`, memory-sorted node placement so the smallest-VRAM node gets the smallest shard, and disk-checkpointed phases for resumable runs.

	---

	## Cohere2-specific handling

	Command-A's Cohere2 architecture needed three fixes beyond the standard modelopt NVFP4 flow:

	1. Tied embeddings. Cohere2 sets `use_embedding_sharing=true` — the output projection reuses `model.embed_tokens.weight` and the checkpoint has no separate `lm_head.weight`. The head-bearing shard reconstructs `lm_head` from `embed_tokens` so it can be exported (BF16) into the merged model.
	2. Norm epsilon name. Cohere2 names the layernorm epsilon `layer_norm_eps` (Llama/Mistral use `rms_norm_eps`); the per-layer export template reads the correct attribute with a fallback.
	3. Generation config. Cohere2's `generation_config` sets `cache_implementation=hybrid`, which the 1-layer export template (built with `use_cache=False`) rejects. It is dropped during per-layer export and the real `generation_config` is restored in the merged model.

	Calibration health-check on the run that produced this artifact — clean, no zero or NaN amax statistics:

	- shard0 (layers 0–29 + embed): good=210, zero=0, nan=0
	- shard1 (layers 30–59): good=210, zero=0, nan=0
	- shard2 (layers 60–63 + norm + lm_head): good=28, zero=0, nan=0

	(`NVFP4_DEFAULT_CFG` inserts 7 weight quantizers per layer.)

	After merge, `config.json` is patched to keep `lm_head` in `quantization_config.ignore`, set `input_activations.dynamic: true`, and inject `input_scale=1.0` for every weight quantizer (modelopt 0.43 omits these keys, and vLLM's loader otherwise registers an uninitialized parameter and decodes garbage).

	---

	## Verification

	Loaded and smoke-tested on a single DGX Spark (GB10) with vLLM `0.20.2rc1.dev53` — `FlashInferCutlassNvFp4LinearKernel` for the NVFP4 GEMM, FlashInfer attention backend:

	- Model weights occupy ~62.6 GiB; on a 128 GB UMA Spark at `gpu-memory-utilization 0.90` this leaves a ~43 GiB KV-cache pool (≈175K tokens at 4K context).
	- All test generations are coherent and accurate — e.g. "The capital of France is" → "Paris. The area of France is 212,935 square miles…"; "17 + 25" → "42."

	The weight-scale layout was also verified directly: every `down_proj.weight_scale` is `[12288, 2304]` (2304 = intermediate 36864 / `group_size` 16), and there are no stray `_quantizer` / `_double_scale` keys — the checkpoint loads with stock vLLM.

	A formal throughput benchmark has not been run yet.

	---

	## Usage

	### vLLM (serve)

	Verified on GB10 with vLLM `0.20.2rc1`:

	```bash
	vllm serve /path/to/Fallen-Command-111B-NVFP4 \
	--served-model-name Fallen-Command-111B-NVFP4 \
	--attention-backend flashinfer \
	--dtype auto \
	--kv-cache-dtype fp8 \
	--max-model-len 32768 \
	--max-num-seqs 4 \
	--gpu-memory-utilization 0.90 \
	--enable-chunked-prefill \
	--enable-prefix-caching \
	--port 9007
	```

	vLLM auto-detects the modelopt NVFP4 quantization from `config.json` — no explicit `--quantization` flag is needed. `--gpu-memory-utilization 0.90` leaves enough KV-cache pool for 32K context at `max-num-seqs 4` on a 128 GB Spark; drop to 0.85 if you don't need the longer context.

	### llama-swap entry

	```yaml
	"Fallen-Command-111B-NVFP4":
	proxy: "http://127.0.0.1:9007"
	ttl: 0
	checkEndpoint: "/health"
	cmd: >-
	/home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
	--model /home/<user>/models/Fallen-Command-111B-NVFP4
	--served-model-name Fallen-Command-111B-NVFP4
	--attention-backend flashinfer
	--dtype auto
	--kv-cache-dtype fp8
	--max-model-len 32768
	--max-num-seqs 4
	--gpu-memory-utilization 0.90
	--enable-chunked-prefill
	--enable-prefix-caching
	--port 9007
	--host 127.0.0.1
	```

	### Prompt format

	Use the Cohere / Command chat template (it ships in `tokenizer_config.json`, so `apply_chat_template` and vLLM's OpenAI server handle it automatically). See [TheDrummer's original card](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1) for finetune-specific usage notes.

	---

	## Files in this repository

	- `model-NNNNN-of-00014.safetensors` — 14 shards, NVFP4-packed weights + scales (~69 GB total)
	- `model.safetensors.index.json` — weight map (1859 keys: 448 quantized linears × 3 scale/weight keys + injected `input_scale` keys + 64 layernorms + embed + lm_head)
	- `config.json` — Cohere2 config with `quantization_config.ignore=["lm_head"]` and `input_activations.dynamic: true`
	- `hf_quant_config.json`, `generation_config.json` — auxiliary modelopt + generation configs
	- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` — Command-A tokenizer, untouched from upstream

	---

	## Acknowledgments

	- [TheDrummer](https://huggingface.co/TheDrummer) for the Fallen-Command-A-111B finetune
	- [Cohere / Cohere Labs](https://huggingface.co/CohereLabs) for the Command-A base model and the Cohere2 architecture
	- NVIDIA for the DGX Spark / GB10 platform, the NVFP4 format, and [modelopt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
	- vLLM project for modelopt NVFP4 inference support

	---

	## License

	This NVFP4 quantization inherits the license of the base model TheDrummer/Fallen-Command-A-111B-v1.1, which is derived from Cohere's Command-A — released under CC-BY-NC 4.0 with Cohere's Acceptable Use Policy. For research, evaluation, and personal non-commercial use only.

	- Pipeline code (Apache 2.0): https://github.com/KaletoAI/distrib-nvfp4

	---

	## Status

	Single-author release. Feedback welcome — on the model artifact (vLLM behaviour, sampling, RP quality) and on the pipeline that built it.