Upload folder using huggingface_hub

d2b5d17 verified 4 months ago

4.18 kB

	---
	license: apache-2.0
	base_model:
	- Gryphe/Codex-24B-Small-3.2
	datasets:
	- Gryphe/Opus-WritingPrompts
	pipeline_tag: text-generation
	tags:
	- text adventure
	- roleplay
	- rpg
	- creative writing
	- nvfp4
	- vllm
	- conversational
	---
	# Codex-24B-Small-3.2 (NVFP4 quant)

	This repo contains Codex-24B-Small-3.2 quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia RTX 5000s series GPUs.

	> ℹ️ This model is limited to Hopper and Blackwell family of GPUs and will not work with RTX 3000s and RTX 4000s GPUs.
	> Please use the NVFP4A16 model otherwise OR enable slow emulation `export VLLM_USE_NVFP4_CT_EMULATIONS=1`

	- Original Model:
	- [Gryphe/Codex-24B-Small-3.2](https://huggingface.co/Gryphe/Codex-24B-Small-3.2)
	- RTX 3000s and 4000s GPUs fallback model:
	- [mratsim/Codex-24B-Small-3.2-NVFP4A16](https://huggingface.co/mratsim/Codex-24B-Small-3.2-NVFP4A16)

	NVFP4 writeups:
	- https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
	- https://arxiv.org/pdf/2509.25149

	## 📥 Usage & Running Instructions

	The model was tested with vLLM + 1x RTX Pro 6000.

	### Hardware

	As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later).
	Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.

	You may still run this model with emulation albeit slowly by setting `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
	otherwise use the alternative [mratsim/Codex-24B-Small-3.2-NVFP4A16](https://huggingface.co/mratsim/Codex-24B-Small-3.2-NVFP4A16)

	### Recommendations

	It is however recommended to use at most 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)
	especially with a model as small as 24B.

	This model is recommended with "min-p" sampling, this sampling is available through
	both the oldest Text completions API and the Chat completions API (and there is a new Response API),
	however most LLM frontends only support modifying min-p when using Text completions.
	You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to override the sampler (which is a merge of generation_config.json and vLLM defaults)

	### Running script

	```bash
	# Model configuration (Mandatory)
	MODEL="mratsim/Codex-24B-Small-3.2-NVFP4"
	MODELNAME="Codex-24B-Small-3.2"
	CONTEXT_SIZE=32768
	GPU_UTIL=0.95

	# Sampling configuration (Optional, if departing from `generation_config.json`)
	SAMPLER_OVERRIDE='{"temperature": 0.5, "min_p": 0.05, "top_p": 1, "repetition_penalty": 1.05}'

	# Prevent vLLM from using 100% CPU when idle (Very Recommended)
	export VLLM_SLEEP_WHEN_IDLE=1

	# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
	export VLLM_ATTENTION_BACKEND=FLASHINFER

	vllm serve "${MODEL}" \
	--served-model-name "${MODELNAME}" \
	--gpu-memory-utilization ${GPU_UTIL} \
	--max-model-len "${CONTEXT_SIZE}" \
	--override-generation-config "${SAMPLER_OVERRIDE}"
	```

	> ℹ️ The FlashInfer backend may fail with an error similar to
	> `Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator`.
	>
	> A workaround is running a sed replacement command within vllm install to increase buffer space
	> ```bash
	> sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py
	> ```
	> This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344

	## 🔬 Quantization method

	The llmcompressor library was used with the following recipe:

	```yaml
	default_stage:
	default_modifiers:
	QuantizationModifier:
	targets: [Linear]
	ignore: [lm_head]
	scheme: NVFP4
	```

	and calibrated on 64 samples, 8192 sequence length of [`Gryphe/Opus-WritingPrompts`](https://huggingface.co/datasets/Gryphe/Opus-WritingPrompts)

	NVFP4 quantization requires very few number of samples, llmcompressor uses 20 in their examples.
	Comparatively 512 is recommended for GPTQ and 64 for AWQ (https://minjiazhang.github.io/courses/fall24-resource/slides/awq.pdf)