How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="flwrlabs/Lizzy-7B-GGUF",
	filename="",
)
output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Lizzy 7B GGUF Quantized Models

Lizzy 7B header figure (light theme)

Quantized GGUF models for efficient CPU/GPU inference

πŸ“Š Model Variants β€’ πŸš€ Quick Start β€’ πŸ“š Documentation

Overview

This repository contains GGUF-quantized versions of the Lizzy 7B, a reasoning-enhanced language model from Flower Labs with British knowledge and behavior enhancements.

Model Variants

Quantization File Size Quality Retention Recommended Use Case
Q5_K_M ⭐ 4.8 GB 95% Best balance of quality and size
Q4_K_M 4.2 GB 92% Resource-constrained environments
Q8_0 7.2 GB 99% Near-lossless compression
Q6_K 5.6 GB 97% Between Q5 and Q8
f16 13.6 GB 100% Maximum quality, benchmarking

Quick Start

Using llama.cpp (Recommended)

# Clone llama.cpp with Lizzy support
git clone https://github.com/relogu/llama.cpp.git
cd llama.cpp
git checkout lorenzo-dev

# Build with CUDA support
make LLAMA_CUDA=1

# Run inference with recommended Q5_K_M quantization
./main -m lizzy-7b-Q5_K_M.gguf \
       -p "What is the capital of England?" \
       -n 128 \
       --temp 0.6 \
       --top-p 0.95 \
       -ngl 32  # Offload all layers to GPU

Using Python (llama-cpp-python)

from llama_cpp import Llama

# Load model with GPU offload
llm = Llama(
    model_path="lizzy-7b-Q5_K_M.gguf",
    n_ctx=65536,  # Full context
    n_gpu_layers=32,  # Offload to GPU
    n_threads=8,
)

# Generate with reasoning
response = llm(
    "Explain why British people queue so much.",
    max_tokens=512,
    temperature=0.6,
    top_p=0.95,
)

print(response["choices"][0]["text"])

Using Ollama

# Create Modelfile
cat > Modelfile << EOF
FROM ./lizzy-7b-Q5_K_M.gguf
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 65536
EOF

# Create and run
ollama create lizzy -f Modelfile
ollama run lizzy "What's the best way to make tea?"

Usage with llama.cpp

Basic Inference

./main -m lizzy-7b-Q5_K_M.gguf \
       -p "User: Hello, assistant!\nAssistant:" \
       -n 256 \
       --temp 0.6 \
       --top-p 0.95 \
       -ngl 32

Chat Mode

./chat -m lizzy-7b-Q5_K_M.gguf \
       -ngl 32 \
       --temp 0.6 \
       --top-p 0.95

Server Mode (API)

./server -m lizzy-7b-Q5_K_M.gguf \
         -ngl 32 \
         --port 8080 \
         --host 0.0.0.0

Then access at http://localhost:8080 or use the API:

curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Why do British people say sorry so often?",
    "n_predict": 256,
    "temperature": 0.6,
    "top_p": 0.95
  }'

Reasoning Behavior

Lizzy 7B is a reasoning model that uses thinking tokens. You'll see output like:

> Let me think about this question about British culture...
> The user is asking about queuing behavior...
> I should explain the cultural significance...

British people queue because it reflects core cultural values of fairness and order...

This is expected behavior - the > prefix indicates the model's reasoning process before providing the final answer.

Documentation

The following sections provide comprehensive documentation for using Lizzy 7B GGUF models.

Architecture Details

  • Base: Lizzy 7B
  • Layers: 32 (with post-norm architecture)
  • Hidden size: 4096
  • Attention: Sliding window (4096) + full attention
  • RoPE: YaRN scaling (factor=8.0, original=8192)
  • Vocab: 100,278 tokens
  • Context: 65,536 tokens
  • Tensors: 355 (including attn_post_norm and ffn_post_norm)

Model Comparison

When to Use GGUF vs. Original Format

Use GGUF when:

  • βœ… You need CPU inference
  • βœ… You want flexible GPU offloading
  • βœ… You need smaller model size
  • βœ… You're using llama.cpp ecosystem
  • βœ… You want fast loading times

Use original Safetensors when:

  • βœ… You need full precision (BF16)
  • βœ… You're using transformers/vLLM
  • βœ… You need tensor parallelism
  • βœ… You're fine-tuning the model

License

These GGUF models are derived from the Lizzy 7B. Please refer to the base model license for redistribution terms.

Base Model: flwrlabs/Lizzy-7B

Citation

If you use Lizzy 7B in your research, please cite:

@model{lizzy-7b-gguf,
  title = {Lizzy 7B},
  author = {Flower Labs},
  year = {2026},
  url = {https://huggingface.co/flwrlabs/Lizzy-7B-GGUF}
}

Support

  • πŸ“š Documentation: See HuggingFace repository files
  • πŸ› Issues: Report on HuggingFace
  • πŸ’¬ Discussions: HuggingFace community forum

Downloads last month
1,620
GGUF
Model size
7B params
Architecture
lizzy
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support