Spaces:
Runtime error
Runtime error
MiniMax Agent
Add complete local Ollama setup with OpenELM - includes setup script, API server, test scripts, and documentation
41831f1 Complete Ollama OpenELM Setup Guide
This guide provides complete instructions to set up a local Ollama instance with Apple's OpenELM model and use it via OpenAI/Anthropic compatible APIs.
Table of Contents
- Prerequisites
- Quick Start (One Command)
- Manual Setup
- Testing
- API Usage
- Docker Compose Setup
- Troubleshooting
Prerequisites
Required Software
- Docker: Install Docker
- NVIDIA Driver: For GPU support (check with
nvidia-smi) - NVIDIA Container Toolkit: Install Guide
Verify GPU Access
# Check NVIDIA driver
nvidia-smi
# Verify Docker can see GPU
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Quick Start (One Command)
For Linux/macOS
# Download and run the complete setup script
curl -O https://raw.githubusercontent.com/your-repo/setup_ollama_openelm.sh
chmod +x setup_ollama_openelm.sh
./setup_ollama_openelm.sh
For Windows (PowerShell)
# Run each command manually (see Manual Setup below)
Manual Setup
Step 1: Start Ollama Container
# Start Ollama with GPU support
docker run -d \
--name ollama \
-v ollama:/root/.ollama \
-p 127.0.0.1:11434:11434 \
--gpus all \
ollama/ollama
# Verify it's running
docker ps | grep ollama
Step 2: Pull OpenELM Model
# Pull the 3B parameter model (2.1 GB)
docker exec -it ollama ollama pull apple/OpenELM-3B-Instruct
# Verify installation
docker exec ollama ollama list
Expected output:
NAME ID SIZE MODIFIED
apple/OpenELM-3B-Instruct:latest abc123... 2.1 GB About a minute ago
Step 3: Install Python Dependencies
# Create virtual environment (optional but recommended)
python3 -m venv ollama_env
source ollama_env/bin/activate # Linux/macOS
# or: .\ollama_env\Scripts\activate # Windows
# Install dependencies
pip install -r requirements_local.txt
Step 4: Run the API Server
# Start the FastAPI server (runs on port 8001)
python app_ollama.py
Or using uvicorn directly:
uvicorn app_ollama:app --host 0.0.0.0 --port 8001
Testing
Test 1: Verify Ollama is Running
# Check Ollama status
curl http://127.0.0.1:11434/api/tags
# Should return something like:
# {"models":[{"name":"apple/OpenELM-3B-Instruct"...}]}
Test 2: Quick Generation Test
# Test basic generation
curl http://127.0.0.1:11434/api/generate \
-d '{"model": "apple/OpenELM-3B-Instruct", "prompt": "Say hello!", "stream": false}'
Test 3: Run Test Scripts
# Make test scripts executable
chmod +x test_curl.sh
# Run curl tests
./test_curl.sh
# Run Python tests (requires openai package)
python test_python.py
Test 4: Test the Full API Server
# Test OpenAI format endpoint
curl -X POST http://127.0.0.1:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "apple/OpenELM-3B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
# Test Anthropic format endpoint
curl -X POST http://127.0.0.1:8001/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "apple/OpenELM-3B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
API Usage
Using OpenAI SDK (Python)
from openai import OpenAI
# Connect to local Ollama
client = OpenAI(
base_url="http://127.0.0.1:11434/v1",
api_key="ollama", # Any string works
)
# Basic usage
response = client.chat.completions.create(
model="apple/OpenELM-3B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing simply."}
],
max_tokens=200,
temperature=0.7
)
print(response.choices[0].message.content)
Using Anthropic SDK (Python)
import anthropic
# Connect to local Ollama (via API server)
client = anthropic.Anthropic(
base_url="http://127.0.0.1:8001/v1",
api_key="ollama", # Any string works
)
# Basic usage
message = client.messages.create(
model="apple/OpenELM-3B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
print(message.content[0].text)
Using cURL
# Basic generation
curl http://127.0.0.1:11434/api/generate \
-d '{"model": "apple/OpenELM-3B-Instruct", "prompt": "Your prompt here"}'
# Chat completion (OpenAI format)
curl http://127.0.0.1:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "apple/OpenELM-3B-Instruct",
"messages": [{"role": "user", "content": "Your prompt here"}],
"max_tokens": 100
}'
Docker Compose Setup
For easier deployment, use Docker Compose:
Step 1: Start All Services
# Start Ollama and API server together
docker-compose up -d
# View logs
docker-compose logs -f
Step 2: Access the Services
- Ollama API: http://localhost:11434
- FastAPI Server: http://localhost:8001
Step 3: Stop Services
docker-compose down
Troubleshooting
Issue: GPU Not Detected
Error: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
Solution:
# Install NVIDIA Container Toolkit
distribution=$(. /etc/ossa;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Issue: Model Download Fails
Error: Error: pull model manifest
Solution:
# Check network connection
curl -I https://huggingface.co
# Retry with verbose output
docker exec -it ollama ollama pull apple/OpenELM-3B-Instruct --verbose
Issue: API Server Can't Connect to Ollama
Error: Connection refused or Ollama not responding
Solution:
# Check if Ollama is running
docker ps | grep ollama
# Check Ollama logs
docker logs ollama
# Restart Ollama
docker restart ollama
Issue: Out of Memory
Error: CUDA out of memory
Solution:
- Reduce
max_tokensparameter - Use smaller batch sizes
- Restart the Ollama container to free memory
Issue: Port Already in Use
Error: Address already in use
Solution:
# Find the process using the port
lsof -i :11434 # Linux/macOS
netstat -ano | findstr :11434 # Windows
# Kill the process or use a different port
File Structure
ollama-openelm/
βββ setup_ollama_openelm.sh # Complete setup script
βββ app_ollama.py # FastAPI server
βββ requirements_local.txt # Python dependencies
βββ docker-compose.yml # Docker Compose configuration
βββ Dockerfile.api # API server Docker image
βββ test_python.py # Python test script
βββ test_curl.sh # cURL test script
βββ README.md # This file
Environment Variables
For API Server
| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://127.0.0.1:11434 |
Ollama server URL |
OLLAMA_MODEL |
apple/OpenELM-3B-Instruct |
Model name |
PORT |
8001 |
API server port |
For Ollama Container
| Variable | Description |
|---|---|
OLLAMA_HOST |
Override the Ollama server URL |
OLLAMA_MODELS |
Path to model storage |
Performance Tips
- GPU Memory: The 3B model uses ~6GB GPU memory
- CPU Inference: Falls back to CPU if no GPU available (slower)
- Batch Size: Use
num_predictto control output length - Temperature: Lower values (0.0-0.5) for more deterministic output
Additional Resources
License
This setup is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses.