Instructions to use MarsupialAI/Llama3_GGUF_Quant_Testing with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MarsupialAI/Llama3_GGUF_Quant_Testing with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MarsupialAI/Llama3_GGUF_Quant_Testing",
	filename="L3-f16-Q4km.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use MarsupialAI/Llama3_GGUF_Quant_Testing with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf MarsupialAI/Llama3_GGUF_Quant_Testing:F16
# Run inference directly in the terminal:
llama cli -hf MarsupialAI/Llama3_GGUF_Quant_Testing:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf MarsupialAI/Llama3_GGUF_Quant_Testing:F16
# Run inference directly in the terminal:
llama cli -hf MarsupialAI/Llama3_GGUF_Quant_Testing:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MarsupialAI/Llama3_GGUF_Quant_Testing:F16
# Run inference directly in the terminal:
./llama-cli -hf MarsupialAI/Llama3_GGUF_Quant_Testing:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MarsupialAI/Llama3_GGUF_Quant_Testing:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MarsupialAI/Llama3_GGUF_Quant_Testing:F16

Use Docker

docker model run hf.co/MarsupialAI/Llama3_GGUF_Quant_Testing:F16

LM Studio
Jan
Ollama
How to use MarsupialAI/Llama3_GGUF_Quant_Testing with Ollama:
```
ollama run hf.co/MarsupialAI/Llama3_GGUF_Quant_Testing:F16
```

Unsloth Studio

How to use MarsupialAI/Llama3_GGUF_Quant_Testing with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MarsupialAI/Llama3_GGUF_Quant_Testing to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MarsupialAI/Llama3_GGUF_Quant_Testing to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MarsupialAI/Llama3_GGUF_Quant_Testing to start chatting

Atomic Chat new
Docker Model Runner
How to use MarsupialAI/Llama3_GGUF_Quant_Testing with Docker Model Runner:
```
docker model run hf.co/MarsupialAI/Llama3_GGUF_Quant_Testing:F16
```

Lemonade

How to use MarsupialAI/Llama3_GGUF_Quant_Testing with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MarsupialAI/Llama3_GGUF_Quant_Testing:F16

Run and chat with the model

lemonade run user.Llama3_GGUF_Quant_Testing-F16

List all available models

lemonade list

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Continued Experiments 2024-05-11

As an imatrix enjoyer, it has been bugging me whether the precision of the quant used to generate the imatrix actually matters. Scuttlebut says "yes, but only a little". Logically, I don't think it should matter to a meaningful extent. PPL scales, so a value that is relatively important at fp16 should also register as relatively important at Q8 or even Q4.

To test this theory properly, I took failspy/Llama-3-8B-Instruct-abliterated and converted it to GGUF in both fp16 and fp32 formats. I then quantized each of those GGUFs to both Q8_0 and Q4_0. I then generated imatrices for each of those six GGUFs. Then I created eight GGUFs quantized at Q4_k_m:

fp32 GGUF, fp32 imatrix
fp16 GGUF, fp16 imatrix
fp32 GGUF, fp32->Q8 imatrix
fp16 GGUF, fp16->Q8 imatrix
fp32 GGUF, fp32->Q4 imatrix
fp16 GGUF, fp16->Q4 imatrix
fp32 GGUF, no imatrix
fp16 GGUF, no imatrix

I ran PPL against all 8 quants, as well as the full fp16 and fp32 GGUFs. All iMatrices were created using Kalomaze's groups_merged.txt. All PPL calcs were run using wiki.short.raw. Results:

GGUF                     PPL
FP16                     11.5923
FP32                     11.5923
Q4km FP16 + FP16 imat    11.9326
Q4km FP32 + FP32 imat    11.9314
Q4km FP16 + Q8 imat      11.9369
Q4km FP32 + Q8 imat      11.9500
Q4km FP16 + Q4 imat      11.9355
Q4km FP32 + Q4 imat      11.9356
Q4km FP16 no imat        12.3612
Q4km FP32 no imat        12.3643

Conclusion:

Importance of quant size used to generate imatrix is borderline non-existant. Sort of. While the Q4km quant made with the fp32 GGUF and the fp32-generated imatrix was best, it was by such a miniscule margin that it is implausible that any difference between that (11.9314) and the Q4km made from the fp16 GGUF with the Q4_0-generaged imatrix (11.9355) could be detected under normal usage. The only counterintuitive result here is that the Q4_0-imat quants outperformed the Q8_0-imat quants. I cannot think of a reason why this should be the case. But as it seemingly is the case, I will be using Q4_0 as my intermediate step for generating imatrices in the future when the full fp16 model is too big for my measly 72GB of VRAM.

Initial Testing 2024-04-25

Some folks are claiming there's something funky going on with GGUF quanting for Llama 3 models. I don't disagree.

Some of those people are speculating that it has something to do with converting the raw weights from bf16 to fp16 instead of converting to fp32 as an intermediate step. I think that's bollocks. There is no logical or mathmatical justification for how that could possibly matter.

So to test this crazy theory, I downloaded Undi95/Meta-Llama-3-8B-Instruct-hf and converted it to GGUF three ways:

fp16 specifically with --outtype f16
fp32 specifically with --outtype f32
"Auto" with no outtype specified

I then quantized each of these conversions to Q4_K_M and ran perplexity tests on everything using my abbreviated wiki.short.raw text file. The results:

FP16 specified:  size 14.9GB    PPL @ fp16 9.5158 +/- 0.15418    PPL @ Q4km 9.6414 +/- 0.15494
FP32 specified:  size 29.9GB    PPL @ fp32 9.5158 +/- 0.15418    PPL @ Q4km 9.6278 +/- 0.15466
None specified:  size 29.9GB    PPL @ ???? 9.5158 +/- 0.15418    PPL @ Q4km 9.6278 +/- 0.15466

As you can see, converting to fp32 has no meaningful effect on PPL compared to converting to fp16. PPL is identical at full weight, and the miniscule loss shown at Q4km is will within the margin of error. There will no doubt be some people who will claim "PpL iSn'T gOoD eNoUgH!!1!". For those people, I have uploaded all GGUFs used in this test. Feel free to use those files to do more extensive testing on your own time. I consider the matter resolved until somebody can conclusively demonstrate otherwise.

Downloads last month: 396

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

16-bit

32-bit

View +16 variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support