flash-1-mini / README.md
Ayushnaik's picture
Add GitHub companion repo link
36d240f verified
---
license: apache-2.0
license_name: apache-2.0
language:
- en
- fr
base_model:
- Qwen/Qwen3.5-4B
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- legal
- canadian-law
- bilingual
- french
- quebec-civil-law
- citation
- instruction-following
- vision-language
---
# flash-1-mini
**A compact, bilingual, vision-capable model specialized for Canadian legal and regulatory work — in English and Canadian French.**
flash-1-mini is a 4-billion-parameter model fine-tuned from Qwen3.5-4B for Canadian legal tasks. It is built for the parts of legal work that have to be right: producing correctly-formatted legal citations and following detailed instructions, across both of Canada's official languages and both of its legal traditions (common law and Quebec civil law). It retains the full general-reasoning and vision capability of its base model.
- **Version:** `flash-1-mini-20260602`
- **Developed by:** Alpine Pacific Trading Inc. (operating as SimpleDirect®)
- **Base model:** Qwen3.5-4B (Apache-2.0)
- **License:** Apache-2.0
- **Languages:** English, Canadian French
- **Modalities:** Text + image input → text output
- **Code & examples:** [github.com/getsimpledirect/flash-1-mini](https://github.com/getsimpledirect/flash-1-mini)
| Spec | Value |
|---|---|
| Parameters | 4.54B |
| Architecture | Qwen3_5ForConditionalGeneration (hybrid linear-attention + full-attention) |
| Hidden size / layers / heads | 2560 / 32 / 16 |
| Vocab | 248,320 |
| Context length | 262,144 |
| Precision | bfloat16 |
| Tied embeddings | Yes |
## Highlights
Measured against its base model under identical conditions (same prompts, same scoring):
- **2.7× more reliable legal citations** — citation-integrity accuracy 42.1% vs 15.8% on the CBLRE benchmark.
- **+22.9 points on instruction-following** — IFEval prompt-strict 53.2% vs 30.3%.
- **Balanced bilingual competence** — privacy-compliance parity ratio of 1.00 (English 90.9% / French 90.9%).
- **Stronger English legal reasoning** — MMLU international law 76.0% vs 70.3%.
- **No loss of general capability** — MMLU unchanged (~69.8%); complex multi-step reasoning improves (BBH 79.0% vs 68.6%).
- **Vision-capable** — reads and reasons over images and documents, inherited from the base.
## Intended use
flash-1-mini is intended as a drafting and research assistant for Canadian legal and regulatory workflows, in English and French, where citation correctness and faithful instruction-following matter. It is suitable for legal-tech builders, compliance teams, and Canadian regulated-industry operators.
It is designed to **assist** legal professionals, not to replace their judgment. Outputs — especially citations — should be verified against primary sources before reliance.
## How to use
flash-1-mini uses the `Qwen3_5ForConditionalGeneration` architecture, which is **native to Transformers ≥ 5.5** — no `trust_remote_code` is required. Install a recent Transformers:
```bash
pip install "transformers>=5.5"
```
```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "simpledirect/flash-1-mini"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, dtype=torch.bfloat16, device_map="auto"
)
# Text
messages = [{"role": "user", "content": [
{"type": "text", "text": "What does section 1 of the Canadian Charter of Rights and Freedoms do?"}
]}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[prompt], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
For image input, include `{"type": "image"}` in the message content and pass `images=[img]` to the processor.
#### Thinking mode
Like its base model, flash-1-mini **thinks by default** — it emits a `<think>...</think>` reasoning block before the final answer. For many legal drafting tasks you will want the direct answer only. Disable thinking by passing `enable_thinking=False` through the chat template:
```python
prompt = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False,
enable_thinking=False, # direct kwarg; emits an empty <think></think> block so the model answers directly
)
```
When serving via vLLM, pass `--reasoning-parser qwen3`; to disable thinking per request, set `chat_template_kwargs={"enable_thinking": False}` in the request body (or keep thinking on for complex reasoning where it helps).
### Serving
The model serves with **vLLM** for production text and multimodal inference (Transformers ≥ 5.5). Greedy decoding (temperature 0) is recommended for legal tasks where determinism matters.
### Quantized GGUF variants (text-only)
GGUF quantizations for CPU / edge inference via `llama.cpp` and Ollama are available in the [`gguf/`](./tree/main/gguf) folder of this repository:
| File | Quant | Size | Notes |
|---|---|---|---|
| `gguf/flash-1-mini-20260602-Q6_K.gguf` | Q6_K | 3.3 GB | Highest fidelity; closest to bf16 |
| `gguf/flash-1-mini-20260602-Q5_K_M.gguf` | Q5_K_M | 2.9 GB | Balanced quality / size |
| `gguf/flash-1-mini-20260602-Q4_K_M.gguf` | Q4_K_M | 2.6 GB | Smallest; quality holds on common tasks |
**Important — these GGUFs are text-only.** The vision tower is not carried in the GGUF format, so image input is **not** supported by the GGUF variants. For multimodal (image) inference, use the bf16 safetensors weights above. Quality scales with bit-depth: Q6_K tracks the bf16 model most closely; lower bit-depths trade some fidelity for size, and on the most demanding legal-citation tasks the higher-bit quants are recommended.
```bash
# llama.cpp
./llama-completion -m flash-1-mini-20260602-Q5_K_M.gguf \
-p "What is the legal test under section 1 of the Canadian Charter?" -n 200 --temp 0
# Ollama (create a Modelfile pointing at the GGUF, then run)
printf 'FROM ./flash-1-mini-20260602-Q5_K_M.gguf\n' > Modelfile
ollama create flash-1-mini -f Modelfile
ollama run flash-1-mini
```
GGUF inference of this architecture (Qwen3.5 hybrid linear-attention / Gated DeltaNet) requires a recent `llama.cpp` build with support for these layers. The multi-token-prediction (MTP) head is excluded from the GGUF (not used at inference). To run the bf16 weights in lower precision instead, load them with `bitsandbytes` 4-bit/8-bit via `BitsAndBytesConfig`.
## Benchmarks
All figures are flash-1-mini vs the Qwen3.5-4B base under identical conditions (same prompts, few-shot counts, scoring, greedy decoding). See the SimpleDirect benchmarking methodology and CBLRE eval-set documentation for full protocol.
| Capability | Base | flash-1-mini |
|---|---|---|
| Legal citation integrity (CBLRE) | 15.8% | **42.1%** |
| Instruction-following (IFEval, prompt-strict) | 30.3% | **53.2%** |
| English legal — international law (MMLU) | 70.3% | **76.0%** |
| English legal — jurisprudence (MMLU) | 79.6% | **81.5%** |
| Complex reasoning (BBH) | 68.6% | **79.0%** |
| General knowledge (MMLU) | 69.8% | 69.8% |
| Privacy-compliance bilingual parity (FR/EN) | — | **1.00** |
### Where it is weaker
Specialization carried measurable costs, reported here in full:
- **Retrieval (RAG):** source-attribution accuracy regressed (80.5% → 75.5% on a leak-proof held-out set). flash-1-mini is not a retrieval/RAG leader.
- **Function-calling (BFCL v4):** overall regressed (37.7% → 28.6%), with multi-turn the weakest sub-category.
- **French professional-law MCQ (Global-MMLU FR):** regressed (49.0% → 44.6%).
- **CBLRE Quebec civil law:** regressed (95.0% → 90.0%).
If your workload is primarily retrieval-grounded QA or tool/function-calling orchestration, evaluate carefully against these numbers.
## Training
flash-1-mini is a supervised fine-tune of Qwen3.5-4B using parameter-efficient adapters (LoRA with DoRA, rank 32 / alpha 64, RS-LoRA), with the vision tower frozen, on a bilingual Canadian legal corpus weighted toward citation production and Quebec civil-law content. The trained adapter was merged into the base weights and the checkpoint canonicalized for serving. The architecture is unchanged from the base.
## Limitations and responsible use
- **Not legal advice.** flash-1-mini produces information to assist qualified professionals; it does not practice law and its outputs are not a substitute for a lawyer.
- **Verify citations.** Citation accuracy is materially improved over the base but is not perfect; verify against primary sources.
- **Bilingual, not omniscient in French.** Parity is strong on tested tracks but French professional-law MCQ regressed; do not assume uniform French superiority.
- **Hallucination.** Like all LLMs, it can produce confident, incorrect output.
- **Quebec register.** The model is evaluated for legal correctness, not certified for Quebec-French dialectal register.
## License and attribution
flash-1-mini is released under the **Apache License 2.0**. It is a modified derivative work of **Qwen3.5-4B** (© Alibaba Cloud / Qwen Team, Apache-2.0). See the `LICENSE` and `NOTICE` files in this repository for the full license text and the required attribution and modification statement.
## Citation
```bibtex
@misc{simpledirect2026flash1mini,
title = {flash-1-mini: A Bilingual Canadian Legal Language Model},
author = {{Alpine Pacific Trading Inc. (operating as SimpleDirect)}},
year = {2026},
note = {Version flash-1-mini-20260602. Derivative of Qwen3.5-4B (Apache-2.0).},
howpublished = {\url{https://huggingface.co/simpledirect/flash-1-mini}}
}
```