Spaces:

oki692
/

ollama-fastapi-streaming

Sleeping

App Files Files Community

ollama-fastapi-streaming / README.md

oki692

Upload README.md with huggingface_hub

0e8027f verified 17 days ago

preview code

raw

history blame contribute delete

3.44 kB

metadata

title: Ollama FastAPI Streaming Server
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit

Ollama FastAPI Real-Time Streaming Server

Fast and optimized FastAPI server with Ollama for real-time streaming inference using deepseek-r1:1.5b model.

🔑 Authentication

All streaming requests require a connect key: manus-ollama-2024

📡 API Endpoints

GET `/`

Health check endpoint returning service status and endpoint URL.

Response:

{
  "status": "online",
  "model": "deepseek-r1:1.5b",
  "endpoint": "https://your-space-url.hf.space"
}

POST `/stream`

Real-time streaming chat completions.

Request:

{
  "prompt": "Explain quantum computing",
  "key": "manus-ollama-2024"
}

Response: Server-Sent Events (SSE) stream

data: {"text": "Quantum", "done": false}
data: {"text": " computing", "done": false}
data: {"text": " is...", "done": true}

GET `/models`

List available models.

Response:

{
  "models": ["deepseek-r1:1.5b"],
  "default": "deepseek-r1:1.5b"
}

GET `/health`

Detailed health check with Ollama connection status.

🚀 Usage Example

Python with httpx

import httpx
import json

url = "https://your-space-url.hf.space/stream"
payload = {
    "prompt": "What is artificial intelligence?",
    "key": "manus-ollama-2024"
}

with httpx.stream("POST", url, json=payload, timeout=300) as response:
    for line in response.iter_lines():
        if line.startswith("data: "):
            data = json.loads(line[6:])
            print(data.get("text", ""), end="", flush=True)
            if data.get("done"):
                break

JavaScript/TypeScript

const response = await fetch('https://your-space-url.hf.space/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    prompt: 'What is artificial intelligence?',
    key: 'manus-ollama-2024'
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const text = decoder.decode(value);
  const lines = text.split('\n');
  
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      console.log(data.text);
      if (data.done) break;
    }
  }
}

cURL

curl -X POST "https://your-space-url.hf.space/stream" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, how are you?", "key": "manus-ollama-2024"}' \
  --no-buffer

⚡ Performance Optimizations

Async I/O: Full async/await architecture for non-blocking operations
Connection pooling: Reusable HTTP connections with httpx
Streaming: Real-time token streaming with minimal latency
Model caching: Model preloaded on startup
Optimized parameters: Tuned temperature, top_k, and top_p for speed

🔒 Security

Connect key authentication required for all streaming endpoints
CORS enabled for browser access
Input validation on all requests

📊 Model Information

Model: deepseek-r1:1.5b
Size: ~1.5B parameters
Optimized for: Fast inference and low latency
Max tokens: 2048 per request

🛠️ Development

Built with:

FastAPI 0.109.0
Ollama (latest)
Python 3.11
Uvicorn ASGI server

📝 License

MIT License