Image-Text-to-Text
Transformers
Safetensors
GGUF
English
French
qwen3_5
legal
canadian-law
bilingual
french
quebec-civil-law
citation
instruction-following
vision-language
conversational
Instructions to use simpledirect/flash-1-mini with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use simpledirect/flash-1-mini with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="simpledirect/flash-1-mini") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("simpledirect/flash-1-mini") model = AutoModelForMultimodalLM.from_pretrained("simpledirect/flash-1-mini") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use simpledirect/flash-1-mini with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="simpledirect/flash-1-mini", filename="gguf/flash-1-mini-20260602-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use simpledirect/flash-1-mini with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf simpledirect/flash-1-mini:Q4_K_M # Run inference directly in the terminal: llama-cli -hf simpledirect/flash-1-mini:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf simpledirect/flash-1-mini:Q4_K_M # Run inference directly in the terminal: llama-cli -hf simpledirect/flash-1-mini:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf simpledirect/flash-1-mini:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf simpledirect/flash-1-mini:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf simpledirect/flash-1-mini:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf simpledirect/flash-1-mini:Q4_K_M
Use Docker
docker model run hf.co/simpledirect/flash-1-mini:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use simpledirect/flash-1-mini with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "simpledirect/flash-1-mini" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "simpledirect/flash-1-mini", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/simpledirect/flash-1-mini:Q4_K_M
- SGLang
How to use simpledirect/flash-1-mini with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "simpledirect/flash-1-mini" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "simpledirect/flash-1-mini", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "simpledirect/flash-1-mini" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "simpledirect/flash-1-mini", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use simpledirect/flash-1-mini with Ollama:
ollama run hf.co/simpledirect/flash-1-mini:Q4_K_M
- Unsloth Studio
How to use simpledirect/flash-1-mini with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for simpledirect/flash-1-mini to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for simpledirect/flash-1-mini to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for simpledirect/flash-1-mini to start chatting
- Pi
How to use simpledirect/flash-1-mini with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf simpledirect/flash-1-mini:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "simpledirect/flash-1-mini:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use simpledirect/flash-1-mini with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf simpledirect/flash-1-mini:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default simpledirect/flash-1-mini:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use simpledirect/flash-1-mini with Docker Model Runner:
docker model run hf.co/simpledirect/flash-1-mini:Q4_K_M
- Lemonade
How to use simpledirect/flash-1-mini with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull simpledirect/flash-1-mini:Q4_K_M
Run and chat with the model
lemonade run user.flash-1-mini-Q4_K_M
List all available models
lemonade list
| license: apache-2.0 | |
| license_name: apache-2.0 | |
| language: | |
| - en | |
| - fr | |
| base_model: | |
| - Qwen/Qwen3.5-4B | |
| pipeline_tag: image-text-to-text | |
| library_name: transformers | |
| tags: | |
| - legal | |
| - canadian-law | |
| - bilingual | |
| - french | |
| - quebec-civil-law | |
| - citation | |
| - instruction-following | |
| - vision-language | |
| # flash-1-mini | |
| **A compact, bilingual, vision-capable model specialized for Canadian legal and regulatory work — in English and Canadian French.** | |
| flash-1-mini is a 4-billion-parameter model fine-tuned from Qwen3.5-4B for Canadian legal tasks. It is built for the parts of legal work that have to be right: producing correctly-formatted legal citations and following detailed instructions, across both of Canada's official languages and both of its legal traditions (common law and Quebec civil law). It retains the full general-reasoning and vision capability of its base model. | |
| - **Version:** `flash-1-mini-20260602` | |
| - **Developed by:** Alpine Pacific Trading Inc. (operating as SimpleDirect®) | |
| - **Base model:** Qwen3.5-4B (Apache-2.0) | |
| - **License:** Apache-2.0 | |
| - **Languages:** English, Canadian French | |
| - **Modalities:** Text + image input → text output | |
| - **Code & examples:** [github.com/getsimpledirect/flash-1-mini](https://github.com/getsimpledirect/flash-1-mini) | |
| | Spec | Value | | |
| |---|---| | |
| | Parameters | 4.54B | | |
| | Architecture | Qwen3_5ForConditionalGeneration (hybrid linear-attention + full-attention) | | |
| | Hidden size / layers / heads | 2560 / 32 / 16 | | |
| | Vocab | 248,320 | | |
| | Context length | 262,144 | | |
| | Precision | bfloat16 | | |
| | Tied embeddings | Yes | | |
| ## Highlights | |
| Measured against its base model under identical conditions (same prompts, same scoring): | |
| - **2.7× more reliable legal citations** — citation-integrity accuracy 42.1% vs 15.8% on the CBLRE benchmark. | |
| - **+22.9 points on instruction-following** — IFEval prompt-strict 53.2% vs 30.3%. | |
| - **Balanced bilingual competence** — privacy-compliance parity ratio of 1.00 (English 90.9% / French 90.9%). | |
| - **Stronger English legal reasoning** — MMLU international law 76.0% vs 70.3%. | |
| - **No loss of general capability** — MMLU unchanged (~69.8%); complex multi-step reasoning improves (BBH 79.0% vs 68.6%). | |
| - **Vision-capable** — reads and reasons over images and documents, inherited from the base. | |
| ## Intended use | |
| flash-1-mini is intended as a drafting and research assistant for Canadian legal and regulatory workflows, in English and French, where citation correctness and faithful instruction-following matter. It is suitable for legal-tech builders, compliance teams, and Canadian regulated-industry operators. | |
| It is designed to **assist** legal professionals, not to replace their judgment. Outputs — especially citations — should be verified against primary sources before reliance. | |
| ## How to use | |
| flash-1-mini uses the `Qwen3_5ForConditionalGeneration` architecture, which is **native to Transformers ≥ 5.5** — no `trust_remote_code` is required. Install a recent Transformers: | |
| ```bash | |
| pip install "transformers>=5.5" | |
| ``` | |
| ```python | |
| import torch | |
| from transformers import AutoModelForImageTextToText, AutoProcessor | |
| model_id = "simpledirect/flash-1-mini" | |
| processor = AutoProcessor.from_pretrained(model_id) | |
| model = AutoModelForImageTextToText.from_pretrained( | |
| model_id, dtype=torch.bfloat16, device_map="auto" | |
| ) | |
| # Text | |
| messages = [{"role": "user", "content": [ | |
| {"type": "text", "text": "What does section 1 of the Canadian Charter of Rights and Freedoms do?"} | |
| ]}] | |
| prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) | |
| inputs = processor(text=[prompt], return_tensors="pt").to(model.device) | |
| out = model.generate(**inputs, max_new_tokens=256, do_sample=False) | |
| print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| For image input, include `{"type": "image"}` in the message content and pass `images=[img]` to the processor. | |
| #### Thinking mode | |
| Like its base model, flash-1-mini **thinks by default** — it emits a `<think>...</think>` reasoning block before the final answer. For many legal drafting tasks you will want the direct answer only. Disable thinking by passing `enable_thinking=False` through the chat template: | |
| ```python | |
| prompt = processor.apply_chat_template( | |
| messages, add_generation_prompt=True, tokenize=False, | |
| enable_thinking=False, # direct kwarg; emits an empty <think></think> block so the model answers directly | |
| ) | |
| ``` | |
| When serving via vLLM, pass `--reasoning-parser qwen3`; to disable thinking per request, set `chat_template_kwargs={"enable_thinking": False}` in the request body (or keep thinking on for complex reasoning where it helps). | |
| ### Serving | |
| The model serves with **vLLM** for production text and multimodal inference (Transformers ≥ 5.5). Greedy decoding (temperature 0) is recommended for legal tasks where determinism matters. | |
| ### Quantized GGUF variants (text-only) | |
| GGUF quantizations for CPU / edge inference via `llama.cpp` and Ollama are available in the [`gguf/`](./tree/main/gguf) folder of this repository: | |
| | File | Quant | Size | Notes | | |
| |---|---|---|---| | |
| | `gguf/flash-1-mini-20260602-Q6_K.gguf` | Q6_K | 3.3 GB | Highest fidelity; closest to bf16 | | |
| | `gguf/flash-1-mini-20260602-Q5_K_M.gguf` | Q5_K_M | 2.9 GB | Balanced quality / size | | |
| | `gguf/flash-1-mini-20260602-Q4_K_M.gguf` | Q4_K_M | 2.6 GB | Smallest; quality holds on common tasks | | |
| **Important — these GGUFs are text-only.** The vision tower is not carried in the GGUF format, so image input is **not** supported by the GGUF variants. For multimodal (image) inference, use the bf16 safetensors weights above. Quality scales with bit-depth: Q6_K tracks the bf16 model most closely; lower bit-depths trade some fidelity for size, and on the most demanding legal-citation tasks the higher-bit quants are recommended. | |
| ```bash | |
| # llama.cpp | |
| ./llama-completion -m flash-1-mini-20260602-Q5_K_M.gguf \ | |
| -p "What is the legal test under section 1 of the Canadian Charter?" -n 200 --temp 0 | |
| # Ollama (create a Modelfile pointing at the GGUF, then run) | |
| printf 'FROM ./flash-1-mini-20260602-Q5_K_M.gguf\n' > Modelfile | |
| ollama create flash-1-mini -f Modelfile | |
| ollama run flash-1-mini | |
| ``` | |
| GGUF inference of this architecture (Qwen3.5 hybrid linear-attention / Gated DeltaNet) requires a recent `llama.cpp` build with support for these layers. The multi-token-prediction (MTP) head is excluded from the GGUF (not used at inference). To run the bf16 weights in lower precision instead, load them with `bitsandbytes` 4-bit/8-bit via `BitsAndBytesConfig`. | |
| ## Benchmarks | |
| All figures are flash-1-mini vs the Qwen3.5-4B base under identical conditions (same prompts, few-shot counts, scoring, greedy decoding). See the SimpleDirect benchmarking methodology and CBLRE eval-set documentation for full protocol. | |
| | Capability | Base | flash-1-mini | | |
| |---|---|---| | |
| | Legal citation integrity (CBLRE) | 15.8% | **42.1%** | | |
| | Instruction-following (IFEval, prompt-strict) | 30.3% | **53.2%** | | |
| | English legal — international law (MMLU) | 70.3% | **76.0%** | | |
| | English legal — jurisprudence (MMLU) | 79.6% | **81.5%** | | |
| | Complex reasoning (BBH) | 68.6% | **79.0%** | | |
| | General knowledge (MMLU) | 69.8% | 69.8% | | |
| | Privacy-compliance bilingual parity (FR/EN) | — | **1.00** | | |
| ### Where it is weaker | |
| Specialization carried measurable costs, reported here in full: | |
| - **Retrieval (RAG):** source-attribution accuracy regressed (80.5% → 75.5% on a leak-proof held-out set). flash-1-mini is not a retrieval/RAG leader. | |
| - **Function-calling (BFCL v4):** overall regressed (37.7% → 28.6%), with multi-turn the weakest sub-category. | |
| - **French professional-law MCQ (Global-MMLU FR):** regressed (49.0% → 44.6%). | |
| - **CBLRE Quebec civil law:** regressed (95.0% → 90.0%). | |
| If your workload is primarily retrieval-grounded QA or tool/function-calling orchestration, evaluate carefully against these numbers. | |
| ## Training | |
| flash-1-mini is a supervised fine-tune of Qwen3.5-4B using parameter-efficient adapters (LoRA with DoRA, rank 32 / alpha 64, RS-LoRA), with the vision tower frozen, on a bilingual Canadian legal corpus weighted toward citation production and Quebec civil-law content. The trained adapter was merged into the base weights and the checkpoint canonicalized for serving. The architecture is unchanged from the base. | |
| ## Limitations and responsible use | |
| - **Not legal advice.** flash-1-mini produces information to assist qualified professionals; it does not practice law and its outputs are not a substitute for a lawyer. | |
| - **Verify citations.** Citation accuracy is materially improved over the base but is not perfect; verify against primary sources. | |
| - **Bilingual, not omniscient in French.** Parity is strong on tested tracks but French professional-law MCQ regressed; do not assume uniform French superiority. | |
| - **Hallucination.** Like all LLMs, it can produce confident, incorrect output. | |
| - **Quebec register.** The model is evaluated for legal correctness, not certified for Quebec-French dialectal register. | |
| ## License and attribution | |
| flash-1-mini is released under the **Apache License 2.0**. It is a modified derivative work of **Qwen3.5-4B** (© Alibaba Cloud / Qwen Team, Apache-2.0). See the `LICENSE` and `NOTICE` files in this repository for the full license text and the required attribution and modification statement. | |
| ## Citation | |
| ```bibtex | |
| @misc{simpledirect2026flash1mini, | |
| title = {flash-1-mini: A Bilingual Canadian Legal Language Model}, | |
| author = {{Alpine Pacific Trading Inc. (operating as SimpleDirect)}}, | |
| year = {2026}, | |
| note = {Version flash-1-mini-20260602. Derivative of Qwen3.5-4B (Apache-2.0).}, | |
| howpublished = {\url{https://huggingface.co/simpledirect/flash-1-mini}} | |
| } | |
| ``` | |