Instructions to use FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context",
	filename="LLAMA_3_8B_UNALIGNED_BETA-q4_0-mixed-kvQ5_0.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M

Use Docker

docker model run hf.co/FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M

LM Studio
Jan
Ollama
How to use FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context with Ollama:
```
ollama run hf.co/FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M
```

Unsloth Studio new

How to use FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context to start chatting

Docker Model Runner
How to use FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context with Docker Model Runner:
```
docker model run hf.co/FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M
```

Lemonade

How to use FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context:Q4_K_M

Run and chat with the model

lemonade run user.LLAMA-3_8B_Unaligned_BETA_Long_Context-Q4_K_M

List all available models

lemonade list

For now this is just a test of various basic techniques of increasing the context window on the best 8B model there is, that has 1 big problem, the limit of context window being just 16384. I strongly suggest not downloading, or if you do...I guess tell me how bad this is. It's first time I upload anything on HF ever. Or Git. Or do anything on the internet since writing pages in HTTP in 1999.

Go here for original: https://huggingface.co/SicariusSicariiStuff/LLAMA-3_8B_Unaligned_BETA

24.02.2026 - serious tests on Q8_0 started.

I initially had an idea, because I was using Q4_0 and similar on the phone where I have 32768 or 65536 context length set up in Layla. And Layla ignores the context limit unlike LM Studio. But in LM Studio it wasn't viable, despite the model being extremely fast (few seconds to generate response from Q8_0, regardless how far one was into the chat), because LM Studio forces one to the limit.

Since limit in original Llama_3_8B_Unaligned_BETA is 16384, if you entered in LM Studio manually 32768, you would end up, if you were unlucky with 3276 context length and if you were lucky with 12768 context length. And in Layla, chats with 30K tokens and still working were normal.

First I tried adapring RoPE from Wingless Imp 8B by Sicarius. However model went a bit nuts. That being said it's unknown if the issue wasn't on LM Studio side (their standard format of prompt is good, but I noticed that for example Impish Nemo hates it). I'm currently testing Q8_0.

If anyone wants to help, please ask and I will upload one of the standard K quants or ARM quants. No Imatrix though, because I don't have a proper file to generate one.

1st Update: 24/25th.02.2026 - like most Llama 3.1 8B models, the issues seem to start around 40-50K tokens. Prompting has to be way more careful. I didn't try generate super long stories yet, because I don't know how on the software I use (LM Studio and others often have limit of 8192 per message). Maybe I could try to use llama.cpp directly? However something tells me that results might be mixed. This model has been trained on 16K stories, yes - but that usually means that the model will go off script at 24K at latest. Maybe some super low temperature and very strict setting, but then it won't be "creative" writing. (Don't get me started on calling writing using AI "creative" - even the best models can abstract for real only a tiny amount. 8B - not really). I will measure perplexity today, but I need to have access to bigger stories. The biggest contiguous one that I have is maybe 180K tokens (it's one story. I Could access a bigger one though)

Downloads last month: 55

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FunnyPunch/LLAMA-3_8B_Unaligned_BETA_Long_Context

Base model

SicariusSicariiStuff/LLAMA-3_8B_Unaligned_BETA

Quantized

(19)

this model