File size: 4,433 Bytes
2f26161 ad2b339 2f26161 ad2b339 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | ---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- olmo
- nvfp4
- quantized
- long-context
- vllm
- modelopt
datasets:
- allenai/c4
base_model: allenai/Olmo-3-7B-Instruct
pipeline_tag: text-generation
model-index:
- name: OLMo-3-7B-Instruct-NVFP4-1M
results: []
---
# OLMo-3-7B-Instruct-NVFP4-1M
NVFP4 quantized version of [allenai/Olmo-3-7B-Instruct](https://huggingface.co/allenai/Olmo-3-7B-Instruct) with extended 1M token context support via linear RoPE scaling.
## Model Description
This model is the NVFP4 (4-bit floating point) quantized version of OLMo-3-7B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs and Ada Lovelace architecture support. The quantization uses NVIDIA's ModelOpt library with two-level scaling: E4M3 FP8 per block plus FP32 global scale.
### Key Features
- **Base Model:** allenai/Olmo-3-7B-Instruct (7.3B parameters)
- **Quantization Format:** NVFP4 with group_size=16
- **Context Length:** 1,048,576 tokens (1M) via linear RoPE scaling
- **Model Size:** 5.30 GB (64% reduction from 14.60 GB)
- **GPU Memory:** ~5.23 GiB (64% reduction)
## Performance
| Metric | Original | Quantized | Improvement |
|--------|----------|-----------|-------------|
| Model Size | 14.60 GB | 5.30 GB | 64% reduction |
| GPU Memory | 14.6 GB | 5.23 GiB | 64% reduction |
| Context Length | 4,096 | 1,048,576 | 256x increase |
| Inference Speed | - | 31-35 tok/s | - |
## Usage
**Important:** This model requires vLLM with ModelOpt quantization support. It cannot be loaded with standard transformers.
### vLLM Server Deployment
```bash
python3 -m vllm.entrypoints.openai.api_server \
--model Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M \
--quantization modelopt \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--max-model-len 200000 \
--served-model-name 'OLMo-3-7B-NVFP4' \
--host 0.0.0.0 \
--port 8000
```
### Python Usage with vLLM
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M",
quantization="modelopt",
trust_remote_code=True,
gpu_memory_utilization=0.95,
max_model_len=200000
)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)
prompts = ["What is artificial intelligence?"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
```
## Requirements
- **GPU:** NVIDIA GPU with compute capability 8.9+ (Ada Lovelace, Blackwell)
- **vLLM:** Latest version with ModelOpt support
- **Dependencies:** `pip install vllm transformers torchao`
## Quantization Details
- **Algorithm:** NVFP4 (4-bit floating point)
- **Calibration Dataset:** allenai/c4 (2048 samples)
- **Calibration Length:** 2048 tokens per sample
- **Tool:** NVIDIA ModelOpt 0.39.0
- **Group Size:** 16
- **Excluded Layers:** lm_head
## Context Extension
The context was extended from 4,096 to 1,048,576 tokens using linear RoPE scaling:
- **Scaling Factor:** 16x
- **rope_theta:** 50,000,000
- **rope_scaling:** `{"type": "linear", "factor": 16.0}`
Note: Actual usable context depends on available GPU memory. With 120GB GPU at 95% utilization, approximately 200,000 tokens can be stored in KV cache.
## Architecture Compatibility
For vLLM compatibility, the model uses:
- **Architecture:** Olmo2ForCausalLM
- **Model Type:** olmo2
This mapping allows vLLM to properly load the OLMo-3 architecture.
## Limitations
- Requires vLLM with `--quantization modelopt` flag
- Cannot be loaded with standard transformers
- Requires NVIDIA GPU with FP4 support (Ada Lovelace or newer)
- Maximum usable context limited by GPU memory for KV cache
## Intended Use
- Long-context instruction following and chat
- Document analysis and summarization
- Code generation and review
- Research and educational purposes
## License
Apache 2.0 (inherited from base model)
## Citation
```bibtex
@misc{olmo3-nvfp4-1m,
author = {Ex0bit},
title = {OLMo-3-7B-Instruct-NVFP4-1M: NVFP4 Quantized OLMo-3 with 1M Context},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M}}
}
```
## Acknowledgments
- Base model by [Allen Institute for AI (Ai2)](https://allenai.org/)
- Quantization using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
- Inference powered by [vLLM](https://github.com/vllm-project/vllm)
|