fp16 models are actually fp32?

by ShifaMS - opened Jan 7

Jan 7

•

I tried running onnx model on gpu, the inference time insanely high like 25sec for 3sentences long text. Could someone help here if you have already executed code for fp16 or q8 on gpu with low latency?

cmoney1133

Jan 18

yep; you are exactly right. I can help you -- dm me on reddit @cmoney1113

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment