Instructions to use simpledirect/flash-1-mini with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use simpledirect/flash-1-mini with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="simpledirect/flash-1-mini")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("simpledirect/flash-1-mini")
model = AutoModelForMultimodalLM.from_pretrained("simpledirect/flash-1-mini")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use simpledirect/flash-1-mini with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="simpledirect/flash-1-mini",
	filename="gguf/flash-1-mini-20260602-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use simpledirect/flash-1-mini with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf simpledirect/flash-1-mini:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf simpledirect/flash-1-mini:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf simpledirect/flash-1-mini:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf simpledirect/flash-1-mini:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf simpledirect/flash-1-mini:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf simpledirect/flash-1-mini:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf simpledirect/flash-1-mini:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf simpledirect/flash-1-mini:Q4_K_M

Use Docker

docker model run hf.co/simpledirect/flash-1-mini:Q4_K_M

LM Studio
Jan

vLLM

How to use simpledirect/flash-1-mini with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "simpledirect/flash-1-mini"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "simpledirect/flash-1-mini",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/simpledirect/flash-1-mini:Q4_K_M

SGLang

How to use simpledirect/flash-1-mini with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "simpledirect/flash-1-mini" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "simpledirect/flash-1-mini",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "simpledirect/flash-1-mini" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "simpledirect/flash-1-mini",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use simpledirect/flash-1-mini with Ollama:
```
ollama run hf.co/simpledirect/flash-1-mini:Q4_K_M
```

Unsloth Studio

How to use simpledirect/flash-1-mini with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for simpledirect/flash-1-mini to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for simpledirect/flash-1-mini to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for simpledirect/flash-1-mini to start chatting

How to use simpledirect/flash-1-mini with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf simpledirect/flash-1-mini:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "simpledirect/flash-1-mini:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use simpledirect/flash-1-mini with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf simpledirect/flash-1-mini:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default simpledirect/flash-1-mini:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use simpledirect/flash-1-mini with Docker Model Runner:
```
docker model run hf.co/simpledirect/flash-1-mini:Q4_K_M
```

Lemonade

How to use simpledirect/flash-1-mini with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull simpledirect/flash-1-mini:Q4_K_M

Run and chat with the model

lemonade run user.flash-1-mini-Q4_K_M

List all available models

lemonade list

flash-1-mini

File size: 9,734 Bytes

---
license: apache-2.0
license_name: apache-2.0
language:
- en
- fr
base_model:
- Qwen/Qwen3.5-4B
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- legal
- canadian-law
- bilingual
- french
- quebec-civil-law
- citation
- instruction-following
- vision-language
---

# flash-1-mini

**A compact, bilingual, vision-capable model specialized for Canadian legal and regulatory work — in English and Canadian French.**

flash-1-mini is a 4-billion-parameter model fine-tuned from Qwen3.5-4B for Canadian legal tasks. It is built for the parts of legal work that have to be right: producing correctly-formatted legal citations and following detailed instructions, across both of Canada's official languages and both of its legal traditions (common law and Quebec civil law). It retains the full general-reasoning and vision capability of its base model.

- **Version:** `flash-1-mini-20260602`
- **Developed by:** Alpine Pacific Trading Inc. (operating as SimpleDirect®)
- **Base model:** Qwen3.5-4B (Apache-2.0)
- **License:** Apache-2.0
- **Languages:** English, Canadian French
- **Modalities:** Text + image input → text output
- **Code & examples:** [github.com/getsimpledirect/flash-1-mini](https://github.com/getsimpledirect/flash-1-mini)

| Spec | Value |
|---|---|
| Parameters | 4.54B |
| Architecture | Qwen3_5ForConditionalGeneration (hybrid linear-attention + full-attention) |
| Hidden size / layers / heads | 2560 / 32 / 16 |
| Vocab | 248,320 |
| Context length | 262,144 |
| Precision | bfloat16 |
| Tied embeddings | Yes |

## Highlights

Measured against its base model under identical conditions (same prompts, same scoring):

- **2.7× more reliable legal citations** — citation-integrity accuracy 42.1% vs 15.8% on the CBLRE benchmark.
- **+22.9 points on instruction-following** — IFEval prompt-strict 53.2% vs 30.3%.
- **Balanced bilingual competence** — privacy-compliance parity ratio of 1.00 (English 90.9% / French 90.9%).
- **Stronger English legal reasoning** — MMLU international law 76.0% vs 70.3%.
- **No loss of general capability** — MMLU unchanged (~69.8%); complex multi-step reasoning improves (BBH 79.0% vs 68.6%).
- **Vision-capable** — reads and reasons over images and documents, inherited from the base.

## Intended use

flash-1-mini is intended as a drafting and research assistant for Canadian legal and regulatory workflows, in English and French, where citation correctness and faithful instruction-following matter. It is suitable for legal-tech builders, compliance teams, and Canadian regulated-industry operators.

It is designed to **assist** legal professionals, not to replace their judgment. Outputs — especially citations — should be verified against primary sources before reliance.

## How to use

flash-1-mini uses the `Qwen3_5ForConditionalGeneration` architecture, which is **native to Transformers ≥ 5.5** — no `trust_remote_code` is required. Install a recent Transformers:

```bash
pip install "transformers>=5.5"
```

```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "simpledirect/flash-1-mini"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto"
)

# Text
messages = [{"role": "user", "content": [
    {"type": "text", "text": "What does section 1 of the Canadian Charter of Rights and Freedoms do?"}
]}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[prompt], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

For image input, include `{"type": "image"}` in the message content and pass `images=[img]` to the processor.

#### Thinking mode

Like its base model, flash-1-mini **thinks by default** — it emits a `<think>...</think>` reasoning block before the final answer. For many legal drafting tasks you will want the direct answer only. Disable thinking by passing `enable_thinking=False` through the chat template:

```python
prompt = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False,
    enable_thinking=False,   # direct kwarg; emits an empty <think></think> block so the model answers directly
)
```

When serving via vLLM, pass `--reasoning-parser qwen3`; to disable thinking per request, set `chat_template_kwargs={"enable_thinking": False}` in the request body (or keep thinking on for complex reasoning where it helps).

### Serving

The model serves with **vLLM** for production text and multimodal inference (Transformers ≥ 5.5). Greedy decoding (temperature 0) is recommended for legal tasks where determinism matters.

### Quantized GGUF variants (text-only)

GGUF quantizations for CPU / edge inference via `llama.cpp` and Ollama are available in the [`gguf/`](./tree/main/gguf) folder of this repository:

| File | Quant | Size | Notes |
|---|---|---|---|
| `gguf/flash-1-mini-20260602-Q6_K.gguf` | Q6_K | 3.3 GB | Highest fidelity; closest to bf16 |
| `gguf/flash-1-mini-20260602-Q5_K_M.gguf` | Q5_K_M | 2.9 GB | Balanced quality / size |
| `gguf/flash-1-mini-20260602-Q4_K_M.gguf` | Q4_K_M | 2.6 GB | Smallest; quality holds on common tasks |

**Important — these GGUFs are text-only.** The vision tower is not carried in the GGUF format, so image input is **not** supported by the GGUF variants. For multimodal (image) inference, use the bf16 safetensors weights above. Quality scales with bit-depth: Q6_K tracks the bf16 model most closely; lower bit-depths trade some fidelity for size, and on the most demanding legal-citation tasks the higher-bit quants are recommended.

```bash
# llama.cpp
./llama-completion -m flash-1-mini-20260602-Q5_K_M.gguf \
  -p "What is the legal test under section 1 of the Canadian Charter?" -n 200 --temp 0

# Ollama (create a Modelfile pointing at the GGUF, then run)
printf 'FROM ./flash-1-mini-20260602-Q5_K_M.gguf\n' > Modelfile
ollama create flash-1-mini -f Modelfile
ollama run flash-1-mini
```

GGUF inference of this architecture (Qwen3.5 hybrid linear-attention / Gated DeltaNet) requires a recent `llama.cpp` build with support for these layers. The multi-token-prediction (MTP) head is excluded from the GGUF (not used at inference). To run the bf16 weights in lower precision instead, load them with `bitsandbytes` 4-bit/8-bit via `BitsAndBytesConfig`.

## Benchmarks

All figures are flash-1-mini vs the Qwen3.5-4B base under identical conditions (same prompts, few-shot counts, scoring, greedy decoding). See the SimpleDirect benchmarking methodology and CBLRE eval-set documentation for full protocol.

| Capability | Base | flash-1-mini |
|---|---|---|
| Legal citation integrity (CBLRE) | 15.8% | **42.1%** |
| Instruction-following (IFEval, prompt-strict) | 30.3% | **53.2%** |
| English legal — international law (MMLU) | 70.3% | **76.0%** |
| English legal — jurisprudence (MMLU) | 79.6% | **81.5%** |
| Complex reasoning (BBH) | 68.6% | **79.0%** |
| General knowledge (MMLU) | 69.8% | 69.8% |
| Privacy-compliance bilingual parity (FR/EN) | — | **1.00** |

### Where it is weaker

Specialization carried measurable costs, reported here in full:

- **Retrieval (RAG):** source-attribution accuracy regressed (80.5% → 75.5% on a leak-proof held-out set). flash-1-mini is not a retrieval/RAG leader.
- **Function-calling (BFCL v4):** overall regressed (37.7% → 28.6%), with multi-turn the weakest sub-category.
- **French professional-law MCQ (Global-MMLU FR):** regressed (49.0% → 44.6%).
- **CBLRE Quebec civil law:** regressed (95.0% → 90.0%).

If your workload is primarily retrieval-grounded QA or tool/function-calling orchestration, evaluate carefully against these numbers.

## Training

flash-1-mini is a supervised fine-tune of Qwen3.5-4B using parameter-efficient adapters (LoRA with DoRA, rank 32 / alpha 64, RS-LoRA), with the vision tower frozen, on a bilingual Canadian legal corpus weighted toward citation production and Quebec civil-law content. The trained adapter was merged into the base weights and the checkpoint canonicalized for serving. The architecture is unchanged from the base.

## Limitations and responsible use

- **Not legal advice.** flash-1-mini produces information to assist qualified professionals; it does not practice law and its outputs are not a substitute for a lawyer.
- **Verify citations.** Citation accuracy is materially improved over the base but is not perfect; verify against primary sources.
- **Bilingual, not omniscient in French.** Parity is strong on tested tracks but French professional-law MCQ regressed; do not assume uniform French superiority.
- **Hallucination.** Like all LLMs, it can produce confident, incorrect output.
- **Quebec register.** The model is evaluated for legal correctness, not certified for Quebec-French dialectal register.

## License and attribution

flash-1-mini is released under the **Apache License 2.0**. It is a modified derivative work of **Qwen3.5-4B** (© Alibaba Cloud / Qwen Team, Apache-2.0). See the `LICENSE` and `NOTICE` files in this repository for the full license text and the required attribution and modification statement.

## Citation

```bibtex
@misc{simpledirect2026flash1mini,
  title  = {flash-1-mini: A Bilingual Canadian Legal Language Model},
  author = {{Alpine Pacific Trading Inc. (operating as SimpleDirect)}},
  year   = {2026},
  note   = {Version flash-1-mini-20260602. Derivative of Qwen3.5-4B (Apache-2.0).},
  howpublished = {\url{https://huggingface.co/simpledirect/flash-1-mini}}
}
```