Spaces:

oki692
/

ollama-fastapi-streaming

Sleeping

App Files Files Community

ollama-fastapi-streaming / README.md

oki692

Upload README.md with huggingface_hub

0e8027f verified 17 days ago

preview code

raw

history blame contribute delete

3.44 kB

	---
	title: Ollama FastAPI Streaming Server
	emoji: 🚀
	colorFrom: blue
	colorTo: purple
	sdk: docker
	pinned: false
	license: mit
	---

	# Ollama FastAPI Real-Time Streaming Server

	Fast and optimized FastAPI server with Ollama for real-time streaming inference using deepseek-r1:1.5b model.

	## 🔑 Authentication

	All streaming requests require a connect key: `manus-ollama-2024`

	## 📡 API Endpoints

	### GET `/`
	Health check endpoint returning service status and endpoint URL.

	Response:
	```json
	{
	"status": "online",
	"model": "deepseek-r1:1.5b",
	"endpoint": "https://your-space-url.hf.space"
	}
	```

	### POST `/stream`
	Real-time streaming chat completions.

	Request:
	```json
	{
	"prompt": "Explain quantum computing",
	"key": "manus-ollama-2024"
	}
	```

	Response: Server-Sent Events (SSE) stream
	```
	data: {"text": "Quantum", "done": false}
	data: {"text": " computing", "done": false}
	data: {"text": " is...", "done": true}
	```

	### GET `/models`
	List available models.

	Response:
	```json
	{
	"models": ["deepseek-r1:1.5b"],
	"default": "deepseek-r1:1.5b"
	}
	```

	### GET `/health`
	Detailed health check with Ollama connection status.

	## 🚀 Usage Example

	### Python with httpx
	```python
	import httpx
	import json

	url = "https://your-space-url.hf.space/stream"
	payload = {
	"prompt": "What is artificial intelligence?",
	"key": "manus-ollama-2024"
	}

	with httpx.stream("POST", url, json=payload, timeout=300) as response:
	for line in response.iter_lines():
	if line.startswith("data: "):
	data = json.loads(line[6:])
	print(data.get("text", ""), end="", flush=True)
	if data.get("done"):
	break
	```

	### JavaScript/TypeScript
	```javascript
	const response = await fetch('https://your-space-url.hf.space/stream', {
	method: 'POST',
	headers: { 'Content-Type': 'application/json' },
	body: JSON.stringify({
	prompt: 'What is artificial intelligence?',
	key: 'manus-ollama-2024'
	})
	});

	const reader = response.body.getReader();
	const decoder = new TextDecoder();

	while (true) {
	const { done, value } = await reader.read();
	if (done) break;

	const text = decoder.decode(value);
	const lines = text.split('\n');

	for (const line of lines) {
	if (line.startsWith('data: ')) {
	const data = JSON.parse(line.slice(6));
	console.log(data.text);
	if (data.done) break;
	}
	}
	}
	```

	### cURL
	```bash
	curl -X POST "https://your-space-url.hf.space/stream" \
	-H "Content-Type: application/json" \
	-d '{"prompt": "Hello, how are you?", "key": "manus-ollama-2024"}' \
	--no-buffer
	```

	## ⚡ Performance Optimizations

	- Async I/O: Full async/await architecture for non-blocking operations
	- Connection pooling: Reusable HTTP connections with httpx
	- Streaming: Real-time token streaming with minimal latency
	- Model caching: Model preloaded on startup
	- Optimized parameters: Tuned temperature, top_k, and top_p for speed

	## 🔒 Security

	- Connect key authentication required for all streaming endpoints
	- CORS enabled for browser access
	- Input validation on all requests

	## 📊 Model Information

	- Model: deepseek-r1:1.5b
	- Size: ~1.5B parameters
	- Optimized for: Fast inference and low latency
	- Max tokens: 2048 per request

	## 🛠️ Development

	Built with:
	- FastAPI 0.109.0
	- Ollama (latest)
	- Python 3.11
	- Uvicorn ASGI server

	## 📝 License

	MIT License