fp16 models are actually fp32?

#7
by ShifaMS - opened

I tried running onnx model on gpu, the inference time insanely high like 25sec for 3sentences long text. Could someone help here if you have already executed code for fp16 or q8 on gpu with low latency?

yep; you are exactly right. I can help you -- dm me on reddit @cmoney1113

Sign up or log in to comment