Instructions to use ServiceNow-AI/SuperApriel-15B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ServiceNow-AI/SuperApriel-15B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ServiceNow-AI/SuperApriel-15B-Instruct", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained("ServiceNow-AI/SuperApriel-15B-Instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ServiceNow-AI/SuperApriel-15B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ServiceNow-AI/SuperApriel-15B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ServiceNow-AI/SuperApriel-15B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ServiceNow-AI/SuperApriel-15B-Instruct
- SGLang
How to use ServiceNow-AI/SuperApriel-15B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ServiceNow-AI/SuperApriel-15B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ServiceNow-AI/SuperApriel-15B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ServiceNow-AI/SuperApriel-15B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ServiceNow-AI/SuperApriel-15B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ServiceNow-AI/SuperApriel-15B-Instruct with Docker Model Runner:
docker model run hf.co/ServiceNow-AI/SuperApriel-15B-Instruct
Add note about vLLM pure-recurrent placement limitation
Browse filesCustom placements using only GDN/KDA mixers (no FA or SWA) are not
supported due to a vLLM KV cache coordinator constraint. All shipped
presets include attention-type layers and are unaffected.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
README.md
CHANGED
|
@@ -166,6 +166,8 @@ print(output[0].outputs[0].text)
|
|
| 166 |
|
| 167 |
Available presets: `all-attention`, `Reg_Lklhd-26`, `Reg_Lklhd-18`, `Reg_Lklhd-13`, `Reg_Lklhd-10`, `Idealized_All-18`, `Idealized_All-6`, `Idealized_Lklhd-6`, `extra_bayesian-mix-7`.
|
| 168 |
|
|
|
|
|
|
|
| 169 |
### Supernet Mode (Runtime Switching)
|
| 170 |
|
| 171 |
Load the full supernet and switch presets at runtime without reloading the engine. Use `enforce_eager=True` on a single GPU (saves memory by skipping CUDA graph capture):
|
|
@@ -376,6 +378,7 @@ It is **not intended** for use in safety-critical applications without human ove
|
|
| 376 |
- **Language:** Strongest performance is in English. Output quality may degrade in underrepresented languages.
|
| 377 |
- **Critical use:** Not suitable for medical, legal, financial, or other high-risk applications without safeguards.
|
| 378 |
- **Long-context degradation:** Aggressive placements (fewer FA layers) may show reduced performance on long-range tasks like RULER/NIAH.
|
|
|
|
| 379 |
|
| 380 |
## Security and Responsible Use
|
| 381 |
|
|
|
|
| 166 |
|
| 167 |
Available presets: `all-attention`, `Reg_Lklhd-26`, `Reg_Lklhd-18`, `Reg_Lklhd-13`, `Reg_Lklhd-10`, `Idealized_All-18`, `Idealized_All-6`, `Idealized_Lklhd-6`, `extra_bayesian-mix-7`.
|
| 168 |
|
| 169 |
+
> **Note:** Custom placements must include at least one attention-type layer (FA or SWA). Configurations using only recurrent mixers (GDN/KDA) are not currently supported due to a vLLM KV cache coordinator limitation. All shipped presets satisfy this requirement.
|
| 170 |
+
|
| 171 |
### Supernet Mode (Runtime Switching)
|
| 172 |
|
| 173 |
Load the full supernet and switch presets at runtime without reloading the engine. Use `enforce_eager=True` on a single GPU (saves memory by skipping CUDA graph capture):
|
|
|
|
| 378 |
- **Language:** Strongest performance is in English. Output quality may degrade in underrepresented languages.
|
| 379 |
- **Critical use:** Not suitable for medical, legal, financial, or other high-risk applications without safeguards.
|
| 380 |
- **Long-context degradation:** Aggressive placements (fewer FA layers) may show reduced performance on long-range tasks like RULER/NIAH.
|
| 381 |
+
- **vLLM pure-recurrent limitation:** Custom placements using only GDN/KDA mixers (no FA or SWA layers) are not supported with vLLM due to a KV cache coordinator constraint. All shipped presets include attention-type layers.
|
| 382 |
|
| 383 |
## Security and Responsible Use
|
| 384 |
|