Inference takes roughly 3 minutes on a 4090

by macadeliccc - opened Nov 21, 2023

Nov 21, 2023

•

edited Nov 22, 2023

Edit: Doesnt fit on 4090 at all. I had just made an assumption based on every other 7b model, but the demo code wasnt using cuda because it didn't fit

gesoo99

Nov 22, 2023

I made it work on a 3050 Ti Laptop so it's probably something with the settings

macadeliccc

Nov 23, 2023

honestly thats really weird I have not had that issue with any other 7b model. Are you explicitly putting the model and tokenizer onto the GPU? If not then its likely to just use system memory with the demo code

kreouzisv

Nov 23, 2023

@macadeliccc model is loaded into system memory not GPU memory, GPU memory handles compute. I am running it on 61 GB RAM and it occupies roughly 97% of system memory, so you would need something around that to do inference using a 4090.

macadeliccc

Nov 23, 2023

@kreouzisv Thank you. I have just been using the 8Bit quants from TheBloke with llama.cpp and GPU acceleration. Seems to be much more efficient than the raw model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment