oki692's picture
Upload README.md with huggingface_hub
0e8027f verified
metadata
title: Ollama FastAPI Streaming Server
emoji: πŸš€
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit

Ollama FastAPI Real-Time Streaming Server

Fast and optimized FastAPI server with Ollama for real-time streaming inference using deepseek-r1:1.5b model.

πŸ”‘ Authentication

All streaming requests require a connect key: manus-ollama-2024

πŸ“‘ API Endpoints

GET /

Health check endpoint returning service status and endpoint URL.

Response:

{
  "status": "online",
  "model": "deepseek-r1:1.5b",
  "endpoint": "https://your-space-url.hf.space"
}

POST /stream

Real-time streaming chat completions.

Request:

{
  "prompt": "Explain quantum computing",
  "key": "manus-ollama-2024"
}

Response: Server-Sent Events (SSE) stream

data: {"text": "Quantum", "done": false}
data: {"text": " computing", "done": false}
data: {"text": " is...", "done": true}

GET /models

List available models.

Response:

{
  "models": ["deepseek-r1:1.5b"],
  "default": "deepseek-r1:1.5b"
}

GET /health

Detailed health check with Ollama connection status.

πŸš€ Usage Example

Python with httpx

import httpx
import json

url = "https://your-space-url.hf.space/stream"
payload = {
    "prompt": "What is artificial intelligence?",
    "key": "manus-ollama-2024"
}

with httpx.stream("POST", url, json=payload, timeout=300) as response:
    for line in response.iter_lines():
        if line.startswith("data: "):
            data = json.loads(line[6:])
            print(data.get("text", ""), end="", flush=True)
            if data.get("done"):
                break

JavaScript/TypeScript

const response = await fetch('https://your-space-url.hf.space/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    prompt: 'What is artificial intelligence?',
    key: 'manus-ollama-2024'
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const text = decoder.decode(value);
  const lines = text.split('\n');
  
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      console.log(data.text);
      if (data.done) break;
    }
  }
}

cURL

curl -X POST "https://your-space-url.hf.space/stream" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, how are you?", "key": "manus-ollama-2024"}' \
  --no-buffer

⚑ Performance Optimizations

  • Async I/O: Full async/await architecture for non-blocking operations
  • Connection pooling: Reusable HTTP connections with httpx
  • Streaming: Real-time token streaming with minimal latency
  • Model caching: Model preloaded on startup
  • Optimized parameters: Tuned temperature, top_k, and top_p for speed

πŸ”’ Security

  • Connect key authentication required for all streaming endpoints
  • CORS enabled for browser access
  • Input validation on all requests

πŸ“Š Model Information

  • Model: deepseek-r1:1.5b
  • Size: ~1.5B parameters
  • Optimized for: Fast inference and low latency
  • Max tokens: 2048 per request

πŸ› οΈ Development

Built with:

  • FastAPI 0.109.0
  • Ollama (latest)
  • Python 3.11
  • Uvicorn ASGI server

πŸ“ License

MIT License