Upload README.md with huggingface_hub

2f26161 verified 3 months ago

4.43 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- olmo
	- nvfp4
	- quantized
	- long-context
	- vllm
	- modelopt
	datasets:
	- allenai/c4
	base_model: allenai/Olmo-3-7B-Instruct
	pipeline_tag: text-generation
	model-index:
	- name: OLMo-3-7B-Instruct-NVFP4-1M
	results: []
	---

	# OLMo-3-7B-Instruct-NVFP4-1M

	NVFP4 quantized version of [allenai/Olmo-3-7B-Instruct](https://huggingface.co/allenai/Olmo-3-7B-Instruct) with extended 1M token context support via linear RoPE scaling.

	## Model Description

	This model is the NVFP4 (4-bit floating point) quantized version of OLMo-3-7B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs and Ada Lovelace architecture support. The quantization uses NVIDIA's ModelOpt library with two-level scaling: E4M3 FP8 per block plus FP32 global scale.

	### Key Features

	- Base Model: allenai/Olmo-3-7B-Instruct (7.3B parameters)
	- Quantization Format: NVFP4 with group_size=16
	- Context Length: 1,048,576 tokens (1M) via linear RoPE scaling
	- Model Size: 5.30 GB (64% reduction from 14.60 GB)
	- GPU Memory: ~5.23 GiB (64% reduction)

	## Performance

	\| Metric \| Original \| Quantized \| Improvement \|
	\|--------\|----------\|-----------\|-------------\|
	\| Model Size \| 14.60 GB \| 5.30 GB \| 64% reduction \|
	\| GPU Memory \| 14.6 GB \| 5.23 GiB \| 64% reduction \|
	\| Context Length \| 4,096 \| 1,048,576 \| 256x increase \|
	\| Inference Speed \| - \| 31-35 tok/s \| - \|

	## Usage

	Important: This model requires vLLM with ModelOpt quantization support. It cannot be loaded with standard transformers.

	### vLLM Server Deployment

	```bash
	python3 -m vllm.entrypoints.openai.api_server \
	--model Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M \
	--quantization modelopt \
	--trust-remote-code \
	--gpu-memory-utilization 0.95 \
	--max-model-len 200000 \
	--served-model-name 'OLMo-3-7B-NVFP4' \
	--host 0.0.0.0 \
	--port 8000
	```

	### Python Usage with vLLM

	```python
	from vllm import LLM, SamplingParams

	llm = LLM(
	model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M",
	quantization="modelopt",
	trust_remote_code=True,
	gpu_memory_utilization=0.95,
	max_model_len=200000
	)

	sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)

	prompts = ["What is artificial intelligence?"]
	outputs = llm.generate(prompts, sampling_params)

	for output in outputs:
	print(output.outputs[0].text)
	```

	## Requirements

	- GPU: NVIDIA GPU with compute capability 8.9+ (Ada Lovelace, Blackwell)
	- vLLM: Latest version with ModelOpt support
	- Dependencies: `pip install vllm transformers torchao`

	## Quantization Details

	- Algorithm: NVFP4 (4-bit floating point)
	- Calibration Dataset: allenai/c4 (2048 samples)
	- Calibration Length: 2048 tokens per sample
	- Tool: NVIDIA ModelOpt 0.39.0
	- Group Size: 16
	- Excluded Layers: lm_head

	## Context Extension

	The context was extended from 4,096 to 1,048,576 tokens using linear RoPE scaling:

	- Scaling Factor: 16x
	- rope_theta: 50,000,000
	- rope_scaling: `{"type": "linear", "factor": 16.0}`

	Note: Actual usable context depends on available GPU memory. With 120GB GPU at 95% utilization, approximately 200,000 tokens can be stored in KV cache.

	## Architecture Compatibility

	For vLLM compatibility, the model uses:
	- Architecture: Olmo2ForCausalLM
	- Model Type: olmo2

	This mapping allows vLLM to properly load the OLMo-3 architecture.

	## Limitations

	- Requires vLLM with `--quantization modelopt` flag
	- Cannot be loaded with standard transformers
	- Requires NVIDIA GPU with FP4 support (Ada Lovelace or newer)
	- Maximum usable context limited by GPU memory for KV cache

	## Intended Use

	- Long-context instruction following and chat
	- Document analysis and summarization
	- Code generation and review
	- Research and educational purposes

	## License

	Apache 2.0 (inherited from base model)

	## Citation

	```bibtex
	@misc{olmo3-nvfp4-1m,
	author = {Ex0bit},
	title = {OLMo-3-7B-Instruct-NVFP4-1M: NVFP4 Quantized OLMo-3 with 1M Context},
	year = {2024},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M}}
	}
	```

	## Acknowledgments

	- Base model by [Allen Institute for AI (Ai2)](https://allenai.org/)
	- Quantization using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
	- Inference powered by [vLLM](https://github.com/vllm-project/vllm)