Hanrui / sglang /docs /basic_usage /ollama_api.md
Lekr0's picture
Add files using upload-large-folder tool
6268841 verified
# Ollama-Compatible API
SGLang provides Ollama API compatibility, allowing you to use the Ollama CLI and Python library with SGLang as the inference backend.
## Prerequisites
```bash
# Install the Ollama Python library (for Python client usage)
pip install ollama
```
> **Note**: You don't need the Ollama server installed - SGLang acts as the backend. You only need the `ollama` CLI or Python library as the client.
## Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET, HEAD | Health check for Ollama CLI |
| `/api/tags` | GET | List available models |
| `/api/chat` | POST | Chat completions (streaming & non-streaming) |
| `/api/generate` | POST | Text generation (streaming & non-streaming) |
| `/api/show` | POST | Model information |
## Quick Start
### 1. Launch SGLang Server
```bash
python -m sglang.launch_server \
--model Qwen/Qwen2.5-1.5B-Instruct \
--port 30001 \
--host 0.0.0.0
```
> **Note**: The model name used with `ollama run` must match exactly what you passed to `--model`.
### 2. Use Ollama CLI
```bash
# List available models
OLLAMA_HOST=http://localhost:30001 ollama list
# Interactive chat
OLLAMA_HOST=http://localhost:30001 ollama run "Qwen/Qwen2.5-1.5B-Instruct"
```
If connecting to a remote server behind a firewall:
```bash
# SSH tunnel
ssh -L 30001:localhost:30001 user@gpu-server -N &
# Then use Ollama CLI as above
OLLAMA_HOST=http://localhost:30001 ollama list
```
### 3. Use Ollama Python Library
```python
import ollama
client = ollama.Client(host='http://localhost:30001')
# Non-streaming
response = client.chat(
model='Qwen/Qwen2.5-1.5B-Instruct',
messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response['message']['content'])
# Streaming
stream = client.chat(
model='Qwen/Qwen2.5-1.5B-Instruct',
messages=[{'role': 'user', 'content': 'Tell me a story'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
```
## Smart Router
For intelligent routing between local Ollama (fast) and remote SGLang (powerful) using an LLM judge, see the [Smart Router documentation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/ollama/README.md).
## Summary
| Component | Purpose |
|-----------|---------|
| **Ollama API** | Familiar CLI/API that developers already know |
| **SGLang Backend** | High-performance inference engine |
| **Smart Router** | Intelligent routing - fast local for simple tasks, powerful remote for complex tasks |