Spaces:

megharudushi
/

agentic-api

Runtime error

App Files Files Community

agentic-api / README_LOCAL.md

MiniMax Agent

Add complete local Ollama setup with OpenELM - includes setup script, API server, test scripts, and documentation

41831f1 2 months ago

preview code

raw

history blame contribute delete

8.77 kB

	# Complete Ollama OpenELM Setup Guide

	This guide provides complete instructions to set up a local Ollama instance with Apple's OpenELM model and use it via OpenAI/Anthropic compatible APIs.

	## Table of Contents

	1. [Prerequisites](#prerequisites)
	2. [Quick Start (One Command)](#quick-start-one-command)
	3. [Manual Setup](#manual-setup)
	4. [Testing](#testing)
	5. [API Usage](#api-usage)
	6. [Docker Compose Setup](#docker-compose-setup)
	7. [Troubleshooting](#troubleshooting)

	---

	## Prerequisites

	### Required Software

	- Docker: [Install Docker](https://docs.docker.com/get-docker/)
	- NVIDIA Driver: For GPU support (check with `nvidia-smi`)
	- NVIDIA Container Toolkit: [Install Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)

	### Verify GPU Access

	```bash
	# Check NVIDIA driver
	nvidia-smi

	# Verify Docker can see GPU
	docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
	```

	---

	## Quick Start (One Command)

	### For Linux/macOS

	```bash
	# Download and run the complete setup script
	curl -O https://raw.githubusercontent.com/your-repo/setup_ollama_openelm.sh
	chmod +x setup_ollama_openelm.sh
	./setup_ollama_openelm.sh
	```

	### For Windows (PowerShell)

	```powershell
	# Run each command manually (see Manual Setup below)
	```

	---

	## Manual Setup

	### Step 1: Start Ollama Container

	```bash
	# Start Ollama with GPU support
	docker run -d \
	--name ollama \
	-v ollama:/root/.ollama \
	-p 127.0.0.1:11434:11434 \
	--gpus all \
	ollama/ollama

	# Verify it's running
	docker ps \| grep ollama
	```

	### Step 2: Pull OpenELM Model

	```bash
	# Pull the 3B parameter model (2.1 GB)
	docker exec -it ollama ollama pull apple/OpenELM-3B-Instruct

	# Verify installation
	docker exec ollama ollama list
	```

	Expected output:
	```
	NAME ID SIZE MODIFIED
	apple/OpenELM-3B-Instruct:latest abc123... 2.1 GB About a minute ago
	```

	### Step 3: Install Python Dependencies

	```bash
	# Create virtual environment (optional but recommended)
	python3 -m venv ollama_env
	source ollama_env/bin/activate # Linux/macOS
	# or: .\ollama_env\Scripts\activate # Windows

	# Install dependencies
	pip install -r requirements_local.txt
	```

	### Step 4: Run the API Server

	```bash
	# Start the FastAPI server (runs on port 8001)
	python app_ollama.py
	```

	Or using uvicorn directly:
	```bash
	uvicorn app_ollama:app --host 0.0.0.0 --port 8001
	```

	---

	## Testing

	### Test 1: Verify Ollama is Running

	```bash
	# Check Ollama status
	curl http://127.0.0.1:11434/api/tags

	# Should return something like:
	# {"models":[{"name":"apple/OpenELM-3B-Instruct"...}]}
	```

	### Test 2: Quick Generation Test

	```bash
	# Test basic generation
	curl http://127.0.0.1:11434/api/generate \
	-d '{"model": "apple/OpenELM-3B-Instruct", "prompt": "Say hello!", "stream": false}'
	```

	### Test 3: Run Test Scripts

	```bash
	# Make test scripts executable
	chmod +x test_curl.sh

	# Run curl tests
	./test_curl.sh

	# Run Python tests (requires openai package)
	python test_python.py
	```

	### Test 4: Test the Full API Server

	```bash
	# Test OpenAI format endpoint
	curl -X POST http://127.0.0.1:8001/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "apple/OpenELM-3B-Instruct",
	"messages": [{"role": "user", "content": "Hello!"}],
	"max_tokens": 100
	}'

	# Test Anthropic format endpoint
	curl -X POST http://127.0.0.1:8001/v1/messages \
	-H "Content-Type: application/json" \
	-d '{
	"model": "apple/OpenELM-3B-Instruct",
	"messages": [{"role": "user", "content": "Hello!"}],
	"max_tokens": 100
	}'
	```

	---

	## API Usage

	### Using OpenAI SDK (Python)

	```python
	from openai import OpenAI

	# Connect to local Ollama
	client = OpenAI(
	base_url="http://127.0.0.1:11434/v1",
	api_key="ollama", # Any string works
	)

	# Basic usage
	response = client.chat.completions.create(
	model="apple/OpenELM-3B-Instruct",
	messages=[
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Explain quantum computing simply."}
	],
	max_tokens=200,
	temperature=0.7
	)

	print(response.choices[0].message.content)
	```

	### Using Anthropic SDK (Python)

	```python
	import anthropic

	# Connect to local Ollama (via API server)
	client = anthropic.Anthropic(
	base_url="http://127.0.0.1:8001/v1",
	api_key="ollama", # Any string works
	)

	# Basic usage
	message = client.messages.create(
	model="apple/OpenELM-3B-Instruct",
	messages=[{"role": "user", "content": "Hello!"}],
	max_tokens=100
	)

	print(message.content[0].text)
	```

	### Using cURL

	```bash
	# Basic generation
	curl http://127.0.0.1:11434/api/generate \
	-d '{"model": "apple/OpenELM-3B-Instruct", "prompt": "Your prompt here"}'

	# Chat completion (OpenAI format)
	curl http://127.0.0.1:8001/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "apple/OpenELM-3B-Instruct",
	"messages": [{"role": "user", "content": "Your prompt here"}],
	"max_tokens": 100
	}'
	```

	---

	## Docker Compose Setup

	For easier deployment, use Docker Compose:

	### Step 1: Start All Services

	```bash
	# Start Ollama and API server together
	docker-compose up -d

	# View logs
	docker-compose logs -f
	```

	### Step 2: Access the Services

	- Ollama API: http://localhost:11434
	- FastAPI Server: http://localhost:8001

	### Step 3: Stop Services

	```bash
	docker-compose down
	```

	---

	## Troubleshooting

	### Issue: GPU Not Detected

	Error: `Error response from daemon: could not select device driver "" with capabilities: [[gpu]]`

	Solution:
	```bash
	# Install NVIDIA Container Toolkit
	distribution=$(. /etc/ossa;echo $ID$VERSION_ID)
	curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey \| sudo apt-key add -
	curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \| \
	sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

	sudo apt-get update
	sudo apt-get install -y nvidia-container-toolkit
	sudo systemctl restart docker
	```

	### Issue: Model Download Fails

	Error: `Error: pull model manifest`

	Solution:
	```bash
	# Check network connection
	curl -I https://huggingface.co

	# Retry with verbose output
	docker exec -it ollama ollama pull apple/OpenELM-3B-Instruct --verbose
	```

	### Issue: API Server Can't Connect to Ollama

	Error: `Connection refused` or `Ollama not responding`

	Solution:
	```bash
	# Check if Ollama is running
	docker ps \| grep ollama

	# Check Ollama logs
	docker logs ollama

	# Restart Ollama
	docker restart ollama
	```

	### Issue: Out of Memory

	Error: `CUDA out of memory`

	Solution:
	- Reduce `max_tokens` parameter
	- Use smaller batch sizes
	- Restart the Ollama container to free memory

	### Issue: Port Already in Use

	Error: `Address already in use`

	Solution:
	```bash
	# Find the process using the port
	lsof -i :11434 # Linux/macOS
	netstat -ano \| findstr :11434 # Windows

	# Kill the process or use a different port
	```

	---

	## File Structure

	```
	ollama-openelm/
	├── setup_ollama_openelm.sh # Complete setup script
	├── app_ollama.py # FastAPI server
	├── requirements_local.txt # Python dependencies
	├── docker-compose.yml # Docker Compose configuration
	├── Dockerfile.api # API server Docker image
	├── test_python.py # Python test script
	├── test_curl.sh # cURL test script
	└── README.md # This file
	```

	---

	## Environment Variables

	### For API Server

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `OLLAMA_BASE_URL` \| `http://127.0.0.1:11434` \| Ollama server URL \|
	\| `OLLAMA_MODEL` \| `apple/OpenELM-3B-Instruct` \| Model name \|
	\| `PORT` \| `8001` \| API server port \|

	### For Ollama Container

	\| Variable \| Description \|
	\|----------\|-------------\|
	\| `OLLAMA_HOST` \| Override the Ollama server URL \|
	\| `OLLAMA_MODELS` \| Path to model storage \|

	---

	## Performance Tips

	1. GPU Memory: The 3B model uses ~6GB GPU memory
	2. CPU Inference: Falls back to CPU if no GPU available (slower)
	3. Batch Size: Use `num_predict` to control output length
	4. Temperature: Lower values (0.0-0.5) for more deterministic output

	---

	## Additional Resources

	- [Ollama Documentation](https://ollama.com/)
	- [OpenELM Model Card](https://huggingface.co/apple/OpenELM-3B-Instruct)
	- [OpenAI API Compatibility](https://platform.openai.com/docs/api-reference)
	- [FastAPI Documentation](https://fastapi.tiangolo.com/)

	---

	## License

	This setup is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses.