File size: 4,433 Bytes
2f26161
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad2b339
 
 
 
 
 
2f26161
ad2b339
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- olmo
- nvfp4
- quantized
- long-context
- vllm
- modelopt
datasets:
- allenai/c4
base_model: allenai/Olmo-3-7B-Instruct
pipeline_tag: text-generation
model-index:
- name: OLMo-3-7B-Instruct-NVFP4-1M
  results: []
---

# OLMo-3-7B-Instruct-NVFP4-1M

NVFP4 quantized version of [allenai/Olmo-3-7B-Instruct](https://huggingface.co/allenai/Olmo-3-7B-Instruct) with extended 1M token context support via linear RoPE scaling.

## Model Description

This model is the NVFP4 (4-bit floating point) quantized version of OLMo-3-7B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs and Ada Lovelace architecture support. The quantization uses NVIDIA's ModelOpt library with two-level scaling: E4M3 FP8 per block plus FP32 global scale.

### Key Features

- **Base Model:** allenai/Olmo-3-7B-Instruct (7.3B parameters)
- **Quantization Format:** NVFP4 with group_size=16
- **Context Length:** 1,048,576 tokens (1M) via linear RoPE scaling
- **Model Size:** 5.30 GB (64% reduction from 14.60 GB)
- **GPU Memory:** ~5.23 GiB (64% reduction)

## Performance

| Metric | Original | Quantized | Improvement |
|--------|----------|-----------|-------------|
| Model Size | 14.60 GB | 5.30 GB | 64% reduction |
| GPU Memory | 14.6 GB | 5.23 GiB | 64% reduction |
| Context Length | 4,096 | 1,048,576 | 256x increase |
| Inference Speed | - | 31-35 tok/s | - |

## Usage

**Important:** This model requires vLLM with ModelOpt quantization support. It cannot be loaded with standard transformers.

### vLLM Server Deployment

```bash
python3 -m vllm.entrypoints.openai.api_server \
    --model Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M \
    --quantization modelopt \
    --trust-remote-code \
    --gpu-memory-utilization 0.95 \
    --max-model-len 200000 \
    --served-model-name 'OLMo-3-7B-NVFP4' \
    --host 0.0.0.0 \
    --port 8000
```

### Python Usage with vLLM

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M",
    quantization="modelopt",
    trust_remote_code=True,
    gpu_memory_utilization=0.95,
    max_model_len=200000
)

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)

prompts = ["What is artificial intelligence?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
```

## Requirements

- **GPU:** NVIDIA GPU with compute capability 8.9+ (Ada Lovelace, Blackwell)
- **vLLM:** Latest version with ModelOpt support
- **Dependencies:** `pip install vllm transformers torchao`

## Quantization Details

- **Algorithm:** NVFP4 (4-bit floating point)
- **Calibration Dataset:** allenai/c4 (2048 samples)
- **Calibration Length:** 2048 tokens per sample
- **Tool:** NVIDIA ModelOpt 0.39.0
- **Group Size:** 16
- **Excluded Layers:** lm_head

## Context Extension

The context was extended from 4,096 to 1,048,576 tokens using linear RoPE scaling:

- **Scaling Factor:** 16x
- **rope_theta:** 50,000,000
- **rope_scaling:** `{"type": "linear", "factor": 16.0}`

Note: Actual usable context depends on available GPU memory. With 120GB GPU at 95% utilization, approximately 200,000 tokens can be stored in KV cache.

## Architecture Compatibility

For vLLM compatibility, the model uses:
- **Architecture:** Olmo2ForCausalLM
- **Model Type:** olmo2

This mapping allows vLLM to properly load the OLMo-3 architecture.

## Limitations

- Requires vLLM with `--quantization modelopt` flag
- Cannot be loaded with standard transformers
- Requires NVIDIA GPU with FP4 support (Ada Lovelace or newer)
- Maximum usable context limited by GPU memory for KV cache

## Intended Use

- Long-context instruction following and chat
- Document analysis and summarization
- Code generation and review
- Research and educational purposes

## License

Apache 2.0 (inherited from base model)

## Citation

```bibtex
@misc{olmo3-nvfp4-1m,
  author = {Ex0bit},
  title = {OLMo-3-7B-Instruct-NVFP4-1M: NVFP4 Quantized OLMo-3 with 1M Context},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M}}
}
```

## Acknowledgments

- Base model by [Allen Institute for AI (Ai2)](https://allenai.org/)
- Quantization using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
- Inference powered by [vLLM](https://github.com/vllm-project/vllm)