Quasar-Preview-mlx-4bit

A 4-bit (≈4.5 bits/weight) MLX conversion of silx-ai/Quasar-Preview, runnable on Apple Silicon with mlx-lm.

MLX support for this architecture is added in ml-explore/mlx-lm#1407. Until that PR is merged, install mlx-lm from the branch (see below).

Usage

# mlx-lm with the quasar_long model (until #1407 is merged):
pip install "mlx-lm @ git+https://github.com/SahilChachra/mlx-lm@add-quasar-long-model"

python -m mlx_lm.generate \
  --model sahilchachra/Quasar-Preview-mlx-4bit \
  --prompt "The capital of France is" \
  --max-tokens 60 --temp 0.0 --ignore-chat-template

Use --ignore-chat-template. This is a base / preview checkpoint, not instruction-tuned — applying the chat template produces degenerate output. Prompt it as a text-completion model.

Example output:

The capital of France is Paris. The city is located in the northeastern part of
France, along the banks of the Seine River. Paris is known for its rich history,
art, culture, and fashion. It is also a ...

Architecture

Quasar-Long is a hybrid linear-attention MoE model. Every layer runs standard GQA softmax attention (partial RoPE + NoPE-after-512, QK-norm). Layers 4–19 additionally run one linear-attention branch — assigned per layer by hybrid_layerwise_cycle — whose gated output is added to the attention output. The MLP is a 256-expert DeepSeek-V3-style sparse MoE (sigmoid router, group top-k, shared expert + expert bias); layer 0 is dense.

Branch	Layers	Underlying op
GLA	8, 13, 18	gated linear attention (`fla.ops.simple_gla`)
Raven	5, 10, 15	gated slot attention (`fla.ops.gsa`), Mamba2 decay + top-k slot router
Quasar	4,6,7,9,11,12,14,16,17,19	gated delta-rule (`fla.ops.quasar`)

Conversion & verification

Converted with mlx_lm.convert -q --q-bits 4 --q-group-size 64. The MLX port's GLA and Raven recurrences were validated against the reference PyTorch fla naive ops (to 1e-6 / 1e-7); all 580 checkpoint tensors map exactly; the 4-bit model generates coherent text (above).

Credits & license

Base model: silx-ai/Quasar-Preview.
The Raven branch is goombalab/raven's RavenAttention (Gated Slot Attention).
License inherited from the base model (Apache-2.0).

Downloads last month: 79

Safetensors

Model size

17B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for sahilchachra/Quasar-Preview-mlx-4bit

Base model

silx-ai/Quasar-Preview

Quantized

(1)

this model