|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- olmo |
|
|
- nvfp4 |
|
|
- quantized |
|
|
- long-context |
|
|
- vllm |
|
|
- modelopt |
|
|
datasets: |
|
|
- allenai/c4 |
|
|
base_model: allenai/Olmo-3-7B-Instruct |
|
|
pipeline_tag: text-generation |
|
|
model-index: |
|
|
- name: OLMo-3-7B-Instruct-NVFP4-1M |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# OLMo-3-7B-Instruct-NVFP4-1M |
|
|
|
|
|
NVFP4 quantized version of [allenai/Olmo-3-7B-Instruct](https://huggingface.co/allenai/Olmo-3-7B-Instruct) with extended 1M token context support via linear RoPE scaling. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is the NVFP4 (4-bit floating point) quantized version of OLMo-3-7B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs and Ada Lovelace architecture support. The quantization uses NVIDIA's ModelOpt library with two-level scaling: E4M3 FP8 per block plus FP32 global scale. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- **Base Model:** allenai/Olmo-3-7B-Instruct (7.3B parameters) |
|
|
- **Quantization Format:** NVFP4 with group_size=16 |
|
|
- **Context Length:** 1,048,576 tokens (1M) via linear RoPE scaling |
|
|
- **Model Size:** 5.30 GB (64% reduction from 14.60 GB) |
|
|
- **GPU Memory:** ~5.23 GiB (64% reduction) |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Metric | Original | Quantized | Improvement | |
|
|
|--------|----------|-----------|-------------| |
|
|
| Model Size | 14.60 GB | 5.30 GB | 64% reduction | |
|
|
| GPU Memory | 14.6 GB | 5.23 GiB | 64% reduction | |
|
|
| Context Length | 4,096 | 1,048,576 | 256x increase | |
|
|
| Inference Speed | - | 31-35 tok/s | - | |
|
|
|
|
|
## Usage |
|
|
|
|
|
**Important:** This model requires vLLM with ModelOpt quantization support. It cannot be loaded with standard transformers. |
|
|
|
|
|
### vLLM Server Deployment |
|
|
|
|
|
```bash |
|
|
python3 -m vllm.entrypoints.openai.api_server \ |
|
|
--model Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M \ |
|
|
--quantization modelopt \ |
|
|
--trust-remote-code \ |
|
|
--gpu-memory-utilization 0.95 \ |
|
|
--max-model-len 200000 \ |
|
|
--served-model-name 'OLMo-3-7B-NVFP4' \ |
|
|
--host 0.0.0.0 \ |
|
|
--port 8000 |
|
|
``` |
|
|
|
|
|
### Python Usage with vLLM |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
llm = LLM( |
|
|
model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M", |
|
|
quantization="modelopt", |
|
|
trust_remote_code=True, |
|
|
gpu_memory_utilization=0.95, |
|
|
max_model_len=200000 |
|
|
) |
|
|
|
|
|
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512) |
|
|
|
|
|
prompts = ["What is artificial intelligence?"] |
|
|
outputs = llm.generate(prompts, sampling_params) |
|
|
|
|
|
for output in outputs: |
|
|
print(output.outputs[0].text) |
|
|
``` |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- **GPU:** NVIDIA GPU with compute capability 8.9+ (Ada Lovelace, Blackwell) |
|
|
- **vLLM:** Latest version with ModelOpt support |
|
|
- **Dependencies:** `pip install vllm transformers torchao` |
|
|
|
|
|
## Quantization Details |
|
|
|
|
|
- **Algorithm:** NVFP4 (4-bit floating point) |
|
|
- **Calibration Dataset:** allenai/c4 (2048 samples) |
|
|
- **Calibration Length:** 2048 tokens per sample |
|
|
- **Tool:** NVIDIA ModelOpt 0.39.0 |
|
|
- **Group Size:** 16 |
|
|
- **Excluded Layers:** lm_head |
|
|
|
|
|
## Context Extension |
|
|
|
|
|
The context was extended from 4,096 to 1,048,576 tokens using linear RoPE scaling: |
|
|
|
|
|
- **Scaling Factor:** 16x |
|
|
- **rope_theta:** 50,000,000 |
|
|
- **rope_scaling:** `{"type": "linear", "factor": 16.0}` |
|
|
|
|
|
Note: Actual usable context depends on available GPU memory. With 120GB GPU at 95% utilization, approximately 200,000 tokens can be stored in KV cache. |
|
|
|
|
|
## Architecture Compatibility |
|
|
|
|
|
For vLLM compatibility, the model uses: |
|
|
- **Architecture:** Olmo2ForCausalLM |
|
|
- **Model Type:** olmo2 |
|
|
|
|
|
This mapping allows vLLM to properly load the OLMo-3 architecture. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Requires vLLM with `--quantization modelopt` flag |
|
|
- Cannot be loaded with standard transformers |
|
|
- Requires NVIDIA GPU with FP4 support (Ada Lovelace or newer) |
|
|
- Maximum usable context limited by GPU memory for KV cache |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- Long-context instruction following and chat |
|
|
- Document analysis and summarization |
|
|
- Code generation and review |
|
|
- Research and educational purposes |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 (inherited from base model) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{olmo3-nvfp4-1m, |
|
|
author = {Ex0bit}, |
|
|
title = {OLMo-3-7B-Instruct-NVFP4-1M: NVFP4 Quantized OLMo-3 with 1M Context}, |
|
|
year = {2024}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Base model by [Allen Institute for AI (Ai2)](https://allenai.org/) |
|
|
- Quantization using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) |
|
|
- Inference powered by [vLLM](https://github.com/vllm-project/vllm) |
|
|
|