Spaces:
Sleeping
Sleeping
metadata
title: Ollama FastAPI Streaming Server
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
Ollama FastAPI Real-Time Streaming Server
Fast and optimized FastAPI server with Ollama for real-time streaming inference using deepseek-r1:1.5b model.
π Authentication
All streaming requests require a connect key: manus-ollama-2024
π‘ API Endpoints
GET /
Health check endpoint returning service status and endpoint URL.
Response:
{
"status": "online",
"model": "deepseek-r1:1.5b",
"endpoint": "https://your-space-url.hf.space"
}
POST /stream
Real-time streaming chat completions.
Request:
{
"prompt": "Explain quantum computing",
"key": "manus-ollama-2024"
}
Response: Server-Sent Events (SSE) stream
data: {"text": "Quantum", "done": false}
data: {"text": " computing", "done": false}
data: {"text": " is...", "done": true}
GET /models
List available models.
Response:
{
"models": ["deepseek-r1:1.5b"],
"default": "deepseek-r1:1.5b"
}
GET /health
Detailed health check with Ollama connection status.
π Usage Example
Python with httpx
import httpx
import json
url = "https://your-space-url.hf.space/stream"
payload = {
"prompt": "What is artificial intelligence?",
"key": "manus-ollama-2024"
}
with httpx.stream("POST", url, json=payload, timeout=300) as response:
for line in response.iter_lines():
if line.startswith("data: "):
data = json.loads(line[6:])
print(data.get("text", ""), end="", flush=True)
if data.get("done"):
break
JavaScript/TypeScript
const response = await fetch('https://your-space-url.hf.space/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
prompt: 'What is artificial intelligence?',
key: 'manus-ollama-2024'
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
const lines = text.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
console.log(data.text);
if (data.done) break;
}
}
}
cURL
curl -X POST "https://your-space-url.hf.space/stream" \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello, how are you?", "key": "manus-ollama-2024"}' \
--no-buffer
β‘ Performance Optimizations
- Async I/O: Full async/await architecture for non-blocking operations
- Connection pooling: Reusable HTTP connections with httpx
- Streaming: Real-time token streaming with minimal latency
- Model caching: Model preloaded on startup
- Optimized parameters: Tuned temperature, top_k, and top_p for speed
π Security
- Connect key authentication required for all streaming endpoints
- CORS enabled for browser access
- Input validation on all requests
π Model Information
- Model: deepseek-r1:1.5b
- Size: ~1.5B parameters
- Optimized for: Fast inference and low latency
- Max tokens: 2048 per request
π οΈ Development
Built with:
- FastAPI 0.109.0
- Ollama (latest)
- Python 3.11
- Uvicorn ASGI server
π License
MIT License