Instructions to use silx-ai/Quasar-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use silx-ai/Quasar-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="silx-ai/Quasar-Preview", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("silx-ai/Quasar-Preview", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use silx-ai/Quasar-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "silx-ai/Quasar-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "silx-ai/Quasar-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/silx-ai/Quasar-Preview

SGLang

How to use silx-ai/Quasar-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "silx-ai/Quasar-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "silx-ai/Quasar-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "silx-ai/Quasar-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "silx-ai/Quasar-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use silx-ai/Quasar-Preview with Docker Model Runner:
```
docker model run hf.co/silx-ai/Quasar-Preview
```

modeling_quasar_long.py imports a raven/ package that isn't included — model can't be instantiated

by sahilchachra - opened 20 days ago

Discussion

sahilchachra

20 days ago

Summary

Thanks for releasing the Quasar-Preview weights and the fla-based modeling code. I'm working on an MLX port so the model can run on Apple Silicon, and I've hit a blocker: the published repo references a local raven/ package that isn't part of the release, so the model can't be instantiated.

Details

In modeling_quasar_long.py, QuasarLongHybridReplacementSdpaAttention.init runs this for every hybrid attention layer:

if not os.path.isdir(os.path.join(_HERE, "raven")):
raise ModuleNotFoundError("Quasar requires the bundled repo-local raven/ folder for Raven hybrid layers")
from raven.layers.raven import RavenAttention

The raven/ folder is not in the model repo (nor in Quasar-3B-A1B-Preview, the SILX-LABS GitHub org, or PyPI). With config.json setting hybrid_attention_layers = [4..19], this guard fires for all of them, so AutoModelForCausalLM.from_pretrained(...) raises before any forward pass.

Per the config, the Raven branch is used for layers 5, 10, and 15 (hybrid_layerwise_cycle = ["quasar","raven","quasar","quasar","gla"], decay_type="Mamba2", slots=64, topk=32). The checkpoint has the matching weights, but the recurrence they drive isn't recoverable from tensor names alone.

My Ask

Could you publish the raven/ package — specifically raven/layers/raven.py (RavenAttention) and anything it imports? That's the one missing piece blocking instantiation. The Quasar branch (fla.layers.quasar.QuasarAttention), GLA, the MoE block, and the standard attention are all already present in the repo.

If it's easier, even a minimal reference forward pass / equations for the Mamba2-style slot+top-k recurrence would be enough for me to reimplement it faithfully and verify against the original.

Why

Without it, the published checkpoint can't be loaded with the published code, and a faithful port (or any independent reimplementation) can't be verified for correctness. Happy to share the MLX port back once it works. Thanks!

sahilchachra

19 days ago

Read the previous post in the discussion and I am good to go. Thanks!

sahilchachra changed discussion status to closed 19 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment