Loading the non-quantized model in browser

by nbolton04 - opened Feb 19, 2024

Feb 19, 2024

•

edited Feb 19, 2024

My workflow is to embed the documents on GPU. I tried using the quantized model in Python on GPU but saw a significant decrease in performance, and it seems it is at over 1000% CPU usage even with batch size of 1. My next step was to try and load the non-quantized model via transformers.js seeing as i rather the client-side browser inference be slow than having initial processing take that much longer. However, when I do that I get an error. Do you have suggestions or examples of how to do the latter?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment