Spaces:

oki692
/

ollama-fastapi-streaming

Sleeping

File size: 3,442 Bytes

0e5744e
0e8027f
 
 
 
0e5744e
 
0e8027f
0e5744e
 
0e8027f

---
title: Ollama FastAPI Streaming Server
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
---

# Ollama FastAPI Real-Time Streaming Server

Fast and optimized FastAPI server with Ollama for real-time streaming inference using **deepseek-r1:1.5b** model.

## 🔑 Authentication

All streaming requests require a connect key: `manus-ollama-2024`

## 📡 API Endpoints

### GET `/`
Health check endpoint returning service status and endpoint URL.

**Response:**
```json
{
  "status": "online",
  "model": "deepseek-r1:1.5b",
  "endpoint": "https://your-space-url.hf.space"
}
```

### POST `/stream`
Real-time streaming chat completions.

**Request:**
```json
{
  "prompt": "Explain quantum computing",
  "key": "manus-ollama-2024"
}
```

**Response:** Server-Sent Events (SSE) stream
```
data: {"text": "Quantum", "done": false}
data: {"text": " computing", "done": false}
data: {"text": " is...", "done": true}
```

### GET `/models`
List available models.

**Response:**
```json
{
  "models": ["deepseek-r1:1.5b"],
  "default": "deepseek-r1:1.5b"
}
```

### GET `/health`
Detailed health check with Ollama connection status.

## 🚀 Usage Example

### Python with httpx
```python
import httpx
import json

url = "https://your-space-url.hf.space/stream"
payload = {
    "prompt": "What is artificial intelligence?",
    "key": "manus-ollama-2024"
}

with httpx.stream("POST", url, json=payload, timeout=300) as response:
    for line in response.iter_lines():
        if line.startswith("data: "):
            data = json.loads(line[6:])
            print(data.get("text", ""), end="", flush=True)
            if data.get("done"):
                break
```

### JavaScript/TypeScript
```javascript
const response = await fetch('https://your-space-url.hf.space/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    prompt: 'What is artificial intelligence?',
    key: 'manus-ollama-2024'
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const text = decoder.decode(value);
  const lines = text.split('\n');
  
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      console.log(data.text);
      if (data.done) break;
    }
  }
}
```

### cURL
```bash
curl -X POST "https://your-space-url.hf.space/stream" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, how are you?", "key": "manus-ollama-2024"}' \
  --no-buffer
```

## ⚡ Performance Optimizations

- **Async I/O**: Full async/await architecture for non-blocking operations
- **Connection pooling**: Reusable HTTP connections with httpx
- **Streaming**: Real-time token streaming with minimal latency
- **Model caching**: Model preloaded on startup
- **Optimized parameters**: Tuned temperature, top_k, and top_p for speed

## 🔒 Security

- Connect key authentication required for all streaming endpoints
- CORS enabled for browser access
- Input validation on all requests

## 📊 Model Information

- **Model**: deepseek-r1:1.5b
- **Size**: ~1.5B parameters
- **Optimized for**: Fast inference and low latency
- **Max tokens**: 2048 per request

## 🛠️ Development

Built with:
- FastAPI 0.109.0
- Ollama (latest)
- Python 3.11
- Uvicorn ASGI server

## 📝 License

MIT License