File size: 4,433 Bytes

---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- olmo
- nvfp4
- quantized
- long-context
- vllm
- modelopt
datasets:
- allenai/c4
base_model: allenai/Olmo-3-7B-Instruct
pipeline_tag: text-generation
model-index:
- name: OLMo-3-7B-Instruct-NVFP4-1M
  results: []
---

# OLMo-3-7B-Instruct-NVFP4-1M

NVFP4 quantized version of [allenai/Olmo-3-7B-Instruct](https://huggingface.co/allenai/Olmo-3-7B-Instruct) with extended 1M token context support via linear RoPE scaling.

## Model Description

This model is the NVFP4 (4-bit floating point) quantized version of OLMo-3-7B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs and Ada Lovelace architecture support. The quantization uses NVIDIA's ModelOpt library with two-level scaling: E4M3 FP8 per block plus FP32 global scale.

### Key Features

- **Base Model:** allenai/Olmo-3-7B-Instruct (7.3B parameters)
- **Quantization Format:** NVFP4 with group_size=16
- **Context Length:** 1,048,576 tokens (1M) via linear RoPE scaling
- **Model Size:** 5.30 GB (64% reduction from 14.60 GB)
- **GPU Memory:** ~5.23 GiB (64% reduction)

## Performance

| Metric | Original | Quantized | Improvement |
|--------|----------|-----------|-------------|
| Model Size | 14.60 GB | 5.30 GB | 64% reduction |
| GPU Memory | 14.6 GB | 5.23 GiB | 64% reduction |
| Context Length | 4,096 | 1,048,576 | 256x increase |
| Inference Speed | - | 31-35 tok/s | - |

## Usage

**Important:** This model requires vLLM with ModelOpt quantization support. It cannot be loaded with standard transformers.

### vLLM Server Deployment

```bash
python3 -m vllm.entrypoints.openai.api_server \
    --model Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M \
    --quantization modelopt \
    --trust-remote-code \
    --gpu-memory-utilization 0.95 \
    --max-model-len 200000 \
    --served-model-name 'OLMo-3-7B-NVFP4' \
    --host 0.0.0.0 \
    --port 8000
```

### Python Usage with vLLM

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M",
    quantization="modelopt",
    trust_remote_code=True,
    gpu_memory_utilization=0.95,
    max_model_len=200000
)

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)

prompts = ["What is artificial intelligence?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
```

## Requirements

- **GPU:** NVIDIA GPU with compute capability 8.9+ (Ada Lovelace, Blackwell)
- **vLLM:** Latest version with ModelOpt support
- **Dependencies:** `pip install vllm transformers torchao`

## Quantization Details

- **Algorithm:** NVFP4 (4-bit floating point)
- **Calibration Dataset:** allenai/c4 (2048 samples)
- **Calibration Length:** 2048 tokens per sample
- **Tool:** NVIDIA ModelOpt 0.39.0
- **Group Size:** 16
- **Excluded Layers:** lm_head

## Context Extension

The context was extended from 4,096 to 1,048,576 tokens using linear RoPE scaling:

- **Scaling Factor:** 16x
- **rope_theta:** 50,000,000
- **rope_scaling:** `{"type": "linear", "factor": 16.0}`

Note: Actual usable context depends on available GPU memory. With 120GB GPU at 95% utilization, approximately 200,000 tokens can be stored in KV cache.

## Architecture Compatibility

For vLLM compatibility, the model uses:
- **Architecture:** Olmo2ForCausalLM
- **Model Type:** olmo2

This mapping allows vLLM to properly load the OLMo-3 architecture.

## Limitations

- Requires vLLM with `--quantization modelopt` flag
- Cannot be loaded with standard transformers
- Requires NVIDIA GPU with FP4 support (Ada Lovelace or newer)
- Maximum usable context limited by GPU memory for KV cache

## Intended Use

- Long-context instruction following and chat
- Document analysis and summarization
- Code generation and review
- Research and educational purposes

## License

Apache 2.0 (inherited from base model)

## Citation

```bibtex
@misc{olmo3-nvfp4-1m,
  author = {Ex0bit},
  title = {OLMo-3-7B-Instruct-NVFP4-1M: NVFP4 Quantized OLMo-3 with 1M Context},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M}}
}
```

## Acknowledgments

- Base model by [Allen Institute for AI (Ai2)](https://allenai.org/)
- Quantization using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
- Inference powered by [vLLM](https://github.com/vllm-project/vllm)