Instructions to use sahilchachra/Quasar-Preview-mlx-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use sahilchachra/Quasar-Preview-mlx-4bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("sahilchachra/Quasar-Preview-mlx-4bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use sahilchachra/Quasar-Preview-mlx-4bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "sahilchachra/Quasar-Preview-mlx-4bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "sahilchachra/Quasar-Preview-mlx-4bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sahilchachra/Quasar-Preview-mlx-4bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Quasar-Preview-mlx-4bit
A 4-bit (โ4.5 bits/weight) MLX conversion of silx-ai/Quasar-Preview, runnable on Apple Silicon with mlx-lm.
MLX support for this architecture is added in ml-explore/mlx-lm#1407. Until that PR is merged, install mlx-lm from the branch (see below).
Usage
# mlx-lm with the quasar_long model (until #1407 is merged):
pip install "mlx-lm @ git+https://github.com/SahilChachra/mlx-lm@add-quasar-long-model"
python -m mlx_lm.generate \
--model sahilchachra/Quasar-Preview-mlx-4bit \
--prompt "The capital of France is" \
--max-tokens 60 --temp 0.0 --ignore-chat-template
Use
--ignore-chat-template. This is a base / preview checkpoint, not instruction-tuned โ applying the chat template produces degenerate output. Prompt it as a text-completion model.
Example output:
The capital of France is Paris. The city is located in the northeastern part of
France, along the banks of the Seine River. Paris is known for its rich history,
art, culture, and fashion. It is also a ...
Architecture
Quasar-Long is a hybrid linear-attention MoE model. Every layer runs standard GQA softmax attention (partial RoPE + NoPE-after-512, QK-norm). Layers 4โ19 additionally run one linear-attention branch โ assigned per layer by hybrid_layerwise_cycle โ whose gated output is added to the attention output. The MLP is a 256-expert DeepSeek-V3-style sparse MoE (sigmoid router, group top-k, shared expert + expert bias); layer 0 is dense.
| Branch | Layers | Underlying op |
|---|---|---|
| GLA | 8, 13, 18 | gated linear attention (fla.ops.simple_gla) |
| Raven | 5, 10, 15 | gated slot attention (fla.ops.gsa), Mamba2 decay + top-k slot router |
| Quasar | 4,6,7,9,11,12,14,16,17,19 | gated delta-rule (fla.ops.quasar) |
Conversion & verification
Converted with mlx_lm.convert -q --q-bits 4 --q-group-size 64. The MLX port's GLA and Raven recurrences were validated against the reference PyTorch fla naive ops (to 1e-6 / 1e-7); all 580 checkpoint tensors map exactly; the 4-bit model generates coherent text (above).
Credits & license
- Base model: silx-ai/Quasar-Preview.
- The Raven branch is goombalab/raven's
RavenAttention(Gated Slot Attention). - License inherited from the base model (Apache-2.0).
- Downloads last month
- 79
4-bit
Model tree for sahilchachra/Quasar-Preview-mlx-4bit
Base model
silx-ai/Quasar-Preview