--- language: - en license: apache-2.0 library_name: transformers tags: - olmo - nvfp4 - quantized - long-context - vllm - modelopt datasets: - allenai/c4 base_model: allenai/Olmo-3-7B-Instruct pipeline_tag: text-generation model-index: - name: OLMo-3-7B-Instruct-NVFP4-1M results: [] --- # OLMo-3-7B-Instruct-NVFP4-1M NVFP4 quantized version of [allenai/Olmo-3-7B-Instruct](https://huggingface.co/allenai/Olmo-3-7B-Instruct) with extended 1M token context support via linear RoPE scaling. ## Model Description This model is the NVFP4 (4-bit floating point) quantized version of OLMo-3-7B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs and Ada Lovelace architecture support. The quantization uses NVIDIA's ModelOpt library with two-level scaling: E4M3 FP8 per block plus FP32 global scale. ### Key Features - **Base Model:** allenai/Olmo-3-7B-Instruct (7.3B parameters) - **Quantization Format:** NVFP4 with group_size=16 - **Context Length:** 1,048,576 tokens (1M) via linear RoPE scaling - **Model Size:** 5.30 GB (64% reduction from 14.60 GB) - **GPU Memory:** ~5.23 GiB (64% reduction) ## Performance | Metric | Original | Quantized | Improvement | |--------|----------|-----------|-------------| | Model Size | 14.60 GB | 5.30 GB | 64% reduction | | GPU Memory | 14.6 GB | 5.23 GiB | 64% reduction | | Context Length | 4,096 | 1,048,576 | 256x increase | | Inference Speed | - | 31-35 tok/s | - | ## Usage **Important:** This model requires vLLM with ModelOpt quantization support. It cannot be loaded with standard transformers. ### vLLM Server Deployment ```bash python3 -m vllm.entrypoints.openai.api_server \ --model Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M \ --quantization modelopt \ --trust-remote-code \ --gpu-memory-utilization 0.95 \ --max-model-len 200000 \ --served-model-name 'OLMo-3-7B-NVFP4' \ --host 0.0.0.0 \ --port 8000 ``` ### Python Usage with vLLM ```python from vllm import LLM, SamplingParams llm = LLM( model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M", quantization="modelopt", trust_remote_code=True, gpu_memory_utilization=0.95, max_model_len=200000 ) sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512) prompts = ["What is artificial intelligence?"] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text) ``` ## Requirements - **GPU:** NVIDIA GPU with compute capability 8.9+ (Ada Lovelace, Blackwell) - **vLLM:** Latest version with ModelOpt support - **Dependencies:** `pip install vllm transformers torchao` ## Quantization Details - **Algorithm:** NVFP4 (4-bit floating point) - **Calibration Dataset:** allenai/c4 (2048 samples) - **Calibration Length:** 2048 tokens per sample - **Tool:** NVIDIA ModelOpt 0.39.0 - **Group Size:** 16 - **Excluded Layers:** lm_head ## Context Extension The context was extended from 4,096 to 1,048,576 tokens using linear RoPE scaling: - **Scaling Factor:** 16x - **rope_theta:** 50,000,000 - **rope_scaling:** `{"type": "linear", "factor": 16.0}` Note: Actual usable context depends on available GPU memory. With 120GB GPU at 95% utilization, approximately 200,000 tokens can be stored in KV cache. ## Architecture Compatibility For vLLM compatibility, the model uses: - **Architecture:** Olmo2ForCausalLM - **Model Type:** olmo2 This mapping allows vLLM to properly load the OLMo-3 architecture. ## Limitations - Requires vLLM with `--quantization modelopt` flag - Cannot be loaded with standard transformers - Requires NVIDIA GPU with FP4 support (Ada Lovelace or newer) - Maximum usable context limited by GPU memory for KV cache ## Intended Use - Long-context instruction following and chat - Document analysis and summarization - Code generation and review - Research and educational purposes ## License Apache 2.0 (inherited from base model) ## Citation ```bibtex @misc{olmo3-nvfp4-1m, author = {Ex0bit}, title = {OLMo-3-7B-Instruct-NVFP4-1M: NVFP4 Quantized OLMo-3 with 1M Context}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M}} } ``` ## Acknowledgments - Base model by [Allen Institute for AI (Ai2)](https://allenai.org/) - Quantization using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) - Inference powered by [vLLM](https://github.com/vllm-project/vllm)