Custom 4-bit Finetuning 5-7 times faster inference than QLora

#25

pinned

by rmihaylov - opened May 31, 2023

Discussion

rmihaylov

May 31, 2023

https://github.com/rmihaylov/falcontune

Ichsan2895

Jun 4, 2023

•

edited Jun 4, 2023

Excuse me, some question for you..

What is the different between your falcontune and QLoRA?
What is the different fine tuning (with the new dataset) in Bitsandbytes+peft and your code? Or maybe your script is the simple form of bitsandbytes+peft?
Can I activate 'nf4' (normal four bit float) in the GPTQ?

FalconLLM pinned discussion Jun 9, 2023

dimaischenko

Jun 10, 2023

Excuse me, some question for you..

I join in the questions!

cr00

Jul 1, 2023

Doesn't 40b require like 48Gb of VRAM? also if anyone reads this I would be very appreciative for any insight into cost efficient/realistic hardware for ML, it seems like the cheapest build is somewhere in the neighborhood of $5-6k, and I think I would rather have my own hardware than rely on Amazon/Google/Azure, Thanks

andyecher7

Jul 12, 2023

•

edited Jul 12, 2023

Falcon 40b inference in 8bit takes 45gb of ram. On single RTX A6000 48GB (not ADA version) on AMD EPIC 7713 DDR4 pc take around 4 second to generate 20 tokens (words), in 4bit -it takes 25gb ram and 12 second for same 20 tokens - not sure why..

...
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
PATH,
device_map="auto"
trust_remote_code=True,
quantization_config=bnb_config,
)

wasiim

Jul 13, 2023

can anyone help me please
i have the text data stored in .txt the text data is simple information about a technology
i want to fine tune the falcon model and the i want to ask the question to the falcon model according to that .txt file

archonlith

Aug 15, 2023

Falcon 40b inference in 8bit takes 45gb of ram. On single RTX A6000 48GB (not ADA version) on AMD EPIC 7713 DDR4 pc take around 4 second to generate 20 tokens (words), in 4bit -it takes 25gb ram and 12 second for same 20 tokens - not sure why..

I would also love to know why it takes so long.

My main reason, (and I suspect many people's) main use case for GPT alternatives include both open source AND hopefully faster speed. Reducing the memory profile but increasing the lag seems like a lateral move.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment