Instructions to use silx-ai/Quasar-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use silx-ai/Quasar-Preview with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="silx-ai/Quasar-Preview", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("silx-ai/Quasar-Preview", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use silx-ai/Quasar-Preview with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "silx-ai/Quasar-Preview" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "silx-ai/Quasar-Preview", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/silx-ai/Quasar-Preview
- SGLang
How to use silx-ai/Quasar-Preview with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "silx-ai/Quasar-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "silx-ai/Quasar-Preview", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "silx-ai/Quasar-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "silx-ai/Quasar-Preview", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use silx-ai/Quasar-Preview with Docker Model Runner:
docker model run hf.co/silx-ai/Quasar-Preview
modeling_quasar_long.py imports a raven/ package that isn't included β model can't be instantiated
Summary
Thanks for releasing the Quasar-Preview weights and the fla-based modeling code. I'm working on an MLX port so the model can run on Apple Silicon, and I've hit a blocker: the published repo references a local raven/ package that isn't part of the release, so the model can't be instantiated.
Details
In modeling_quasar_long.py, QuasarLongHybridReplacementSdpaAttention.init runs this for every hybrid attention layer:
if not os.path.isdir(os.path.join(_HERE, "raven")):
raise ModuleNotFoundError("Quasar requires the bundled repo-local raven/ folder for Raven hybrid layers")
from raven.layers.raven import RavenAttention
The raven/ folder is not in the model repo (nor in Quasar-3B-A1B-Preview, the SILX-LABS GitHub org, or PyPI). With config.json setting hybrid_attention_layers = [4..19], this guard fires for all of them, so AutoModelForCausalLM.from_pretrained(...) raises before any forward pass.
Per the config, the Raven branch is used for layers 5, 10, and 15 (hybrid_layerwise_cycle = ["quasar","raven","quasar","quasar","gla"], decay_type="Mamba2", slots=64, topk=32). The checkpoint has the matching weights, but the recurrence they drive isn't recoverable from tensor names alone.
My Ask
Could you publish the raven/ package β specifically raven/layers/raven.py (RavenAttention) and anything it imports? That's the one missing piece blocking instantiation. The Quasar branch (fla.layers.quasar.QuasarAttention), GLA, the MoE block, and the standard attention are all already present in the repo.
If it's easier, even a minimal reference forward pass / equations for the Mamba2-style slot+top-k recurrence would be enough for me to reimplement it faithfully and verify against the original.
Why
Without it, the published checkpoint can't be loaded with the published code, and a faithful port (or any independent reimplementation) can't be verified for correctness. Happy to share the MLX port back once it works. Thanks!
Read the previous post in the discussion and I am good to go. Thanks!