---
language:
- en
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
tags:
- lizzy-7b
- flwrlabs
- british-english
- text-generation
model_name: Lizzy 7B
---

# Lizzy 7B

<img class="dark:hidden" src="./header-light.svg" alt="Lizzy 7B header figure (light theme)" />
<img class="hidden dark:block" src="./header-dark.svg" alt="Lizzy 7B header figure (dark theme)" />

## Model Name And Summary

Lizzy 7B is an open-weight Flower Labs assistant model in the Lizzy family.

## Architecture And Configuration

Lizzy 7B is a 7B-class decoder-only transformer with long-context support, sliding/local attention behaviour, custom chat/control tokens, and deployment-specific serving configurations.

Representative configuration points:

- 7B-class parameter scale with a 32-layer stack;
- long-context configuration up to 65k tokens with runtime caps adjusted by deployment profile;
- 32 attention heads with long-context/sliding-attention behaviour;
- custom tokenizer and chat markers for instruction-style prompting;
- deployment variants may include quantised revisions, runtime patches, and serving-time configuration changes.

## Training Approach

Lizzy 7B follows a multi-stage training approach that combines:

- pre-training on large-scale public text, document, code, math, and encyclopedic corpora;
- supervised fine-tuning on instruction-following, dialogue, reasoning, and tool-use examples;
- direct preference optimisation on preference pairs for helpfulness, style, and answer quality;
- reinforcement learning with verifiable rewards for targeted behavioural refinement.

Across these stages, training data has been mixed across:

- broad public text and knowledge sources;
- synthetic instruction and preference data;
- private synthetic data used to favour British behaviour and knowledge;
- UK-specific examples and preference signals used to strengthen local knowledge and style.

## Evaluation Against European Baselines

Britishness comparisons against the European baselines present in the latest local artifact set:

| Benchmark | Lizzy 7B | EuroLLM 9B | Apertus 8B |
| --- | ---: | ---: | ---: |
| Britishness MCQ | 71.0 | <u>77.6</u> | **80.8** |
| Britishness CoT | **80.1** | <u>72.1</u> | 31.7 |
| Britishness Domains | **89.9** | <u>69.0</u> | 32.6 |

Broader benchmark comparisons against the same European baselines:

| Benchmark | Lizzy 7B | EuroLLM 9B | Apertus 8B |
| --- | ---: | ---: | ---: |
| MATH | **77.9** | <u>31.3</u> | 22.4 |
| OMEGA | **29.0** | 4.7 | <u>5.0</u> |
| BigBenchHard | **69.0** | 38.9 | <u>42.4</u> |
| AGI Eval English | **65.6** | 50.2 | <u>50.4</u> |
| MMLU | **67.9** | 57.4 | <u>63.4</u> |
| GPQA | **34.6** | 26.8 | <u>28.1</u> |
| HumanEvalPlus | **70.2** | 28.2 | <u>33.4</u> |
| MBPP+ | **52.5** | 41.7 | <u>42.3</u> |
| LiveCodeBench v3 | **39.1** | 6.3 | <u>8.5</u> |
| IFEval | <u>63.8</u> | 55.8 | **65.1** |
| AIME | **35.8** | 0.2 | <u>0.6</u> |
| GSM8K | **91.8** | <u>64.7</u> | 64.7 |
| IFBench | **22.7** | <u>18.0</u> | 15.3 |
| POPQA | 22.2 | **25.6** | <u>25.1</u> |
| ZebraLogic | **12.4** | 4.4 | <u>5.9</u> |

Summary:

- Lizzy 7B trails the European baselines on Britishness MCQ (a private Flower Labs benchmark) recall-style probing.
- Lizzy 7B leads the reported European baselines on Britishness CoT and Britishness domain reasoning (private Flower Labs benchmarks) where comparable metrics are available.
- Lizzy 7B also leads the latest local European baseline set on most knowledge, reasoning, math, and coding rows represented in the table above.

## Intended Uses And Limitations

Intended uses:

- UK-oriented assistant experiences;
- general reasoning and coding assistance;
- managed deployment through private Hugging Face or vLLM serving stacks.

## Safety And Bias Considerations

The latest safety-evaluation reports the following task-level primary scores:

| Safety benchmark | Metric | Score |
| --- | --- | ---: |
| Overall safety average | `overall_safety_average` | 66.7% |
| WildGuardTest | `inverted_micro_harm_lower` | 91.9% |
| HarmBench | `inverted_micro_asr_lower` | 57.5% |
| ToxiGen (tiny) | `safe_overall` | 90.2% |
| XSTest | `overall_accuracy` | 85.6% |
| StrongReject (logprobs) | `inverted_asr` | 78.8% |
| BBQ | `accuracy` | 66.5% |
| WMDP | `inverted_accuracy` | 47.5% |

Lizzy 7B can still produce incorrect, outdated, or over-confident responses and should be used with human oversight for higher-risk workflows. UK-specific tuning improves local style and cultural alignment but can also bias tone and assumptions toward UK conventions; downstream moderation and policy controls remain required.

## License And Citation

- Model licence: Apache-2.0
- Public and synthetic training sources include open-licensed public data plus private synthetic and UK-specific data that are not redistributed
- Citation and legal text should still be confirmed by owner review before any external publication.

## Python Example (Transformers)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo_id = "flwrlabs/Lizzy-7B"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are Lizzy 7B."},
    {"role": "user", "content": "Summarise why queue etiquette matters in the UK."},
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

output_ids = model.generate(
    **inputs,
    temperature=0.2,
    top_p=0.9,
)
response = tokenizer.decode(output_ids[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)
```

## Multi-GPU vLLM Tensor Parallel Patch

For reproducible multi-GPU vLLM support with Lizzy-family checkpoints, this deliverable bundles:

- bundled draft artifact: `vllm_patches/transformers_lizzy_tp.py`

Apply this patch when all of the following are true:

- runtime uses vLLM via the generic Transformers backend (`model_type=vllm`)
- tensor parallelism is enabled (`tensor_parallel_size > 1`)
- checkpoint is Lizzy-family (including RLVR variants)
- runtime is not guaranteed to include an equivalent upstream fix

You can skip patch bundling only for strict HF-only runs or single-rank vLLM (`TP=1`).

Why this is included:

- it mitigates known Lizzy TP failure modes in generic vLLM Transformers loading
- it fixes rank-local head partitioning and `q_norm`/`k_norm` slicing behaviour
- it prevents the known tensor-shape crash class seen without this patch