File size: 3,442 Bytes
0e5744e
0e8027f
 
 
 
0e5744e
 
0e8027f
0e5744e
 
0e8027f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: Ollama FastAPI Streaming Server
emoji: πŸš€
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
---

# Ollama FastAPI Real-Time Streaming Server

Fast and optimized FastAPI server with Ollama for real-time streaming inference using **deepseek-r1:1.5b** model.

## πŸ”‘ Authentication

All streaming requests require a connect key: `manus-ollama-2024`

## πŸ“‘ API Endpoints

### GET `/`
Health check endpoint returning service status and endpoint URL.

**Response:**
```json
{
  "status": "online",
  "model": "deepseek-r1:1.5b",
  "endpoint": "https://your-space-url.hf.space"
}
```

### POST `/stream`
Real-time streaming chat completions.

**Request:**
```json
{
  "prompt": "Explain quantum computing",
  "key": "manus-ollama-2024"
}
```

**Response:** Server-Sent Events (SSE) stream
```
data: {"text": "Quantum", "done": false}
data: {"text": " computing", "done": false}
data: {"text": " is...", "done": true}
```

### GET `/models`
List available models.

**Response:**
```json
{
  "models": ["deepseek-r1:1.5b"],
  "default": "deepseek-r1:1.5b"
}
```

### GET `/health`
Detailed health check with Ollama connection status.

## πŸš€ Usage Example

### Python with httpx
```python
import httpx
import json

url = "https://your-space-url.hf.space/stream"
payload = {
    "prompt": "What is artificial intelligence?",
    "key": "manus-ollama-2024"
}

with httpx.stream("POST", url, json=payload, timeout=300) as response:
    for line in response.iter_lines():
        if line.startswith("data: "):
            data = json.loads(line[6:])
            print(data.get("text", ""), end="", flush=True)
            if data.get("done"):
                break
```

### JavaScript/TypeScript
```javascript
const response = await fetch('https://your-space-url.hf.space/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    prompt: 'What is artificial intelligence?',
    key: 'manus-ollama-2024'
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const text = decoder.decode(value);
  const lines = text.split('\n');
  
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      console.log(data.text);
      if (data.done) break;
    }
  }
}
```

### cURL
```bash
curl -X POST "https://your-space-url.hf.space/stream" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, how are you?", "key": "manus-ollama-2024"}' \
  --no-buffer
```

## ⚑ Performance Optimizations

- **Async I/O**: Full async/await architecture for non-blocking operations
- **Connection pooling**: Reusable HTTP connections with httpx
- **Streaming**: Real-time token streaming with minimal latency
- **Model caching**: Model preloaded on startup
- **Optimized parameters**: Tuned temperature, top_k, and top_p for speed

## πŸ”’ Security

- Connect key authentication required for all streaming endpoints
- CORS enabled for browser access
- Input validation on all requests

## πŸ“Š Model Information

- **Model**: deepseek-r1:1.5b
- **Size**: ~1.5B parameters
- **Optimized for**: Fast inference and low latency
- **Max tokens**: 2048 per request

## πŸ› οΈ Development

Built with:
- FastAPI 0.109.0
- Ollama (latest)
- Python 3.11
- Uvicorn ASGI server

## πŸ“ License

MIT License