For now this is just a test of various basic techniques of increasing the context window on the best 8B model there is, that has 1 big problem, the limit of context window being just 16384. I strongly suggest not downloading, or if you do...I guess tell me how bad this is. It's first time I upload anything on HF ever. Or Git. Or do anything on the internet since writing pages in HTTP in 1999.
Go here for original: https://huggingface.co/SicariusSicariiStuff/LLAMA-3_8B_Unaligned_BETA
24.02.2026 - serious tests on Q8_0 started.
I initially had an idea, because I was using Q4_0 and similar on the phone where I have 32768 or 65536 context length set up in Layla. And Layla ignores the context limit unlike LM Studio. But in LM Studio it wasn't viable, despite the model being extremely fast (few seconds to generate response from Q8_0, regardless how far one was into the chat), because LM Studio forces one to the limit.
Since limit in original Llama_3_8B_Unaligned_BETA is 16384, if you entered in LM Studio manually 32768, you would end up, if you were unlucky with 3276 context length and if you were lucky with 12768 context length. And in Layla, chats with 30K tokens and still working were normal.
First I tried adapring RoPE from Wingless Imp 8B by Sicarius. However model went a bit nuts. That being said it's unknown if the issue wasn't on LM Studio side (their standard format of prompt is good, but I noticed that for example Impish Nemo hates it). I'm currently testing Q8_0.
If anyone wants to help, please ask and I will upload one of the standard K quants or ARM quants. No Imatrix though, because I don't have a proper file to generate one.
1st Update: 24/25th.02.2026 - like most Llama 3.1 8B models, the issues seem to start around 40-50K tokens. Prompting has to be way more careful. I didn't try generate super long stories yet, because I don't know how on the software I use (LM Studio and others often have limit of 8192 per message). Maybe I could try to use llama.cpp directly? However something tells me that results might be mixed. This model has been trained on 16K stories, yes - but that usually means that the model will go off script at 24K at latest. Maybe some super low temperature and very strict setting, but then it won't be "creative" writing. (Don't get me started on calling writing using AI "creative" - even the best models can abstract for real only a tiny amount. 8B - not really). I will measure perplexity today, but I need to have access to bigger stories. The biggest contiguous one that I have is maybe 180K tokens (it's one story. I Could access a bigger one though)
- Downloads last month
- 156
4-bit
5-bit
6-bit
8-bit