16-bit version?

#13

by saattrupdan - opened Feb 9, 2024

Feb 9, 2024

Do you have plans to upload a 16bit version of your model? That would make it a lot more accessible for inference on smaller GPUs.

shanearora

Feb 9, 2024

@dirkgr Can correct me but I am not aware of such plans. You should be able to load the model and then call, say,model = model.bfloat16() to convert the weights to 16 bits. You may need to load the model on the CPU, downcast to 16 bits, and then move the model to GPU. An alternative with a higher memory requirements (that we used while training the model) is to use torch.autocast with a 16 bit type.

saattrupdan

Feb 9, 2024

@shanearora I completely get that, but if I’m loading in the model with vLLM then I get OOM errors before any conversion can happen. I guess I could convert it and upload it myself, but it would just be a bit more official if you all had a 16bit version uploaded. Same thing with quantised and GGUF versions for that matter, as these are required by other applications like llama.cpp and LM Studio. But it’s up to you - feel free to close this issue if you’re not planning on it 🙂

shanearora

Feb 9, 2024

@akshitab Do you know about OLMo plans in relation to vLLM?

akshitab

Ai2 org Feb 10, 2024

vLLM integration for OLMo is currently in progress here: https://github.com/vllm-project/vllm/issues/2763

baileyk changed discussion status to closed Jul 17, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment