Spaces:
Sleeping
Sleeping
File size: 3,442 Bytes
0e5744e 0e8027f 0e5744e 0e8027f 0e5744e 0e8027f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | ---
title: Ollama FastAPI Streaming Server
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
---
# Ollama FastAPI Real-Time Streaming Server
Fast and optimized FastAPI server with Ollama for real-time streaming inference using **deepseek-r1:1.5b** model.
## π Authentication
All streaming requests require a connect key: `manus-ollama-2024`
## π‘ API Endpoints
### GET `/`
Health check endpoint returning service status and endpoint URL.
**Response:**
```json
{
"status": "online",
"model": "deepseek-r1:1.5b",
"endpoint": "https://your-space-url.hf.space"
}
```
### POST `/stream`
Real-time streaming chat completions.
**Request:**
```json
{
"prompt": "Explain quantum computing",
"key": "manus-ollama-2024"
}
```
**Response:** Server-Sent Events (SSE) stream
```
data: {"text": "Quantum", "done": false}
data: {"text": " computing", "done": false}
data: {"text": " is...", "done": true}
```
### GET `/models`
List available models.
**Response:**
```json
{
"models": ["deepseek-r1:1.5b"],
"default": "deepseek-r1:1.5b"
}
```
### GET `/health`
Detailed health check with Ollama connection status.
## π Usage Example
### Python with httpx
```python
import httpx
import json
url = "https://your-space-url.hf.space/stream"
payload = {
"prompt": "What is artificial intelligence?",
"key": "manus-ollama-2024"
}
with httpx.stream("POST", url, json=payload, timeout=300) as response:
for line in response.iter_lines():
if line.startswith("data: "):
data = json.loads(line[6:])
print(data.get("text", ""), end="", flush=True)
if data.get("done"):
break
```
### JavaScript/TypeScript
```javascript
const response = await fetch('https://your-space-url.hf.space/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
prompt: 'What is artificial intelligence?',
key: 'manus-ollama-2024'
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
const lines = text.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
console.log(data.text);
if (data.done) break;
}
}
}
```
### cURL
```bash
curl -X POST "https://your-space-url.hf.space/stream" \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello, how are you?", "key": "manus-ollama-2024"}' \
--no-buffer
```
## β‘ Performance Optimizations
- **Async I/O**: Full async/await architecture for non-blocking operations
- **Connection pooling**: Reusable HTTP connections with httpx
- **Streaming**: Real-time token streaming with minimal latency
- **Model caching**: Model preloaded on startup
- **Optimized parameters**: Tuned temperature, top_k, and top_p for speed
## π Security
- Connect key authentication required for all streaming endpoints
- CORS enabled for browser access
- Input validation on all requests
## π Model Information
- **Model**: deepseek-r1:1.5b
- **Size**: ~1.5B parameters
- **Optimized for**: Fast inference and low latency
- **Max tokens**: 2048 per request
## π οΈ Development
Built with:
- FastAPI 0.109.0
- Ollama (latest)
- Python 3.11
- Uvicorn ASGI server
## π License
MIT License
|