Any chance of a 16K model?

by MB7977 - opened Oct 17, 2023

Oct 17, 2023

•

edited Oct 17, 2023

Thank you for your wonderful work. Unfortunately, on Linux, even with FA2, the 128g GPTQ version of this model cannot be loaded at 32K context with 2x3090s. Are there any plans to train a 16K version that would be useable for a broader audience? Truncating max_seq_length to 16K on load seems to degrade performance. I'm going to quantize in EXL2 format so that I can load at 32K, but it will mean a very low bit-rate.

grimulkan

Oct 17, 2023

•

edited Oct 17, 2023

If you use rope scaling = 8 and max_seq_len = 16K it should perform like a 16K model (make sure you don't use rope scaling = 4). It flat out beats any 16K fine-tune I've made on raw perplexity at 16K. Maybe with a lot more training at rope scaling = 4 an exclusive 16K model might do better? But I don't think that's worth that much - the PPL drops monotonically all the way to 32K at rope scaling = 8.

MB7977

Oct 17, 2023

Interesting, thank you. I'll give that a go. I was trying 16K at a scaling factor of 4.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment