| | --- |
| | license: apache-2.0 |
| | base_model: |
| | - Gryphe/Codex-24B-Small-3.2 |
| | datasets: |
| | - Gryphe/Opus-WritingPrompts |
| | pipeline_tag: text-generation |
| | tags: |
| | - text adventure |
| | - roleplay |
| | - rpg |
| | - creative writing |
| | - nvfp4 |
| | - vllm |
| | - conversational |
| | --- |
| | # Codex-24B-Small-3.2 (NVFP4 quant) |
| |
|
| | This repo contains Codex-24B-Small-3.2 quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia RTX 5000s series GPUs. |
| |
|
| | > ℹ️ This model is limited to Hopper and Blackwell family of GPUs and will not work with RTX 3000s and RTX 4000s GPUs. |
| | > Please use the NVFP4A16 model otherwise OR enable slow emulation `export VLLM_USE_NVFP4_CT_EMULATIONS=1` |
| |
|
| | - Original Model: |
| | - [Gryphe/Codex-24B-Small-3.2](https://huggingface.co/Gryphe/Codex-24B-Small-3.2) |
| | - RTX 3000s and 4000s GPUs fallback model: |
| | - [mratsim/Codex-24B-Small-3.2-NVFP4A16](https://huggingface.co/mratsim/Codex-24B-Small-3.2-NVFP4A16) |
| |
|
| | NVFP4 writeups: |
| | - https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ |
| | - https://arxiv.org/pdf/2509.25149 |
| |
|
| | ## 📥 Usage & Running Instructions |
| |
|
| | The model was tested with vLLM + 1x RTX Pro 6000. |
| |
|
| | ### Hardware |
| |
|
| | As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later). |
| | Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026. |
| |
|
| | You may still run this model with emulation albeit slowly by setting `export VLLM_USE_NVFP4_CT_EMULATIONS=1` |
| | otherwise use the alternative [mratsim/Codex-24B-Small-3.2-NVFP4A16](https://huggingface.co/mratsim/Codex-24B-Small-3.2-NVFP4A16) |
| |
|
| | ### Recommendations |
| |
|
| | It is however recommended to use at most 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87) |
| | especially with a model as small as 24B. |
| |
|
| | This model is recommended with "min-p" sampling, this sampling is available through |
| | both the oldest Text completions API and the Chat completions API (and there is a new Response API), |
| | however most LLM frontends only support modifying min-p when using Text completions. |
| | You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to override the sampler (which is a merge of generation_config.json and vLLM defaults) |
| | |
| | ### Running script |
| | |
| | ```bash |
| | # Model configuration (Mandatory) |
| | MODEL="mratsim/Codex-24B-Small-3.2-NVFP4" |
| | MODELNAME="Codex-24B-Small-3.2" |
| | CONTEXT_SIZE=32768 |
| | GPU_UTIL=0.95 |
| | |
| | # Sampling configuration (Optional, if departing from `generation_config.json`) |
| | SAMPLER_OVERRIDE='{"temperature": 0.5, "min_p": 0.05, "top_p": 1, "repetition_penalty": 1.05}' |
| |
|
| | # Prevent vLLM from using 100% CPU when idle (Very Recommended) |
| | export VLLM_SLEEP_WHEN_IDLE=1 |
| | |
| | # Use FlashInfer backend (fastest, recommended, "instant" context reprocessing) |
| | export VLLM_ATTENTION_BACKEND=FLASHINFER |
| | |
| | vllm serve "${MODEL}" \ |
| | --served-model-name "${MODELNAME}" \ |
| | --gpu-memory-utilization ${GPU_UTIL} \ |
| | --max-model-len "${CONTEXT_SIZE}" \ |
| | --override-generation-config "${SAMPLER_OVERRIDE}" |
| | ``` |
| | |
| | > ℹ️ The FlashInfer backend may fail with an error similar to |
| | > `Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator`. |
| | > |
| | > A workaround is running a sed replacement command within vllm install to increase buffer space |
| | > ```bash |
| | > sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py |
| | > ``` |
| | > This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344 |
| |
|
| | ## 🔬 Quantization method |
| |
|
| | The llmcompressor library was used with the following recipe: |
| |
|
| | ```yaml |
| | default_stage: |
| | default_modifiers: |
| | QuantizationModifier: |
| | targets: [Linear] |
| | ignore: [lm_head] |
| | scheme: NVFP4 |
| | ``` |
| |
|
| | and calibrated on 64 samples, 8192 sequence length of [`Gryphe/Opus-WritingPrompts`](https://huggingface.co/datasets/Gryphe/Opus-WritingPrompts) |
| |
|
| | NVFP4 quantization requires very few number of samples, llmcompressor uses 20 in their examples. |
| | Comparatively 512 is recommended for GPTQ and 64 for AWQ (https://minjiazhang.github.io/courses/fall24-resource/slides/awq.pdf) |
| |
|