agentic-api / README_LOCAL.md
MiniMax Agent
Add complete local Ollama setup with OpenELM - includes setup script, API server, test scripts, and documentation
41831f1

Complete Ollama OpenELM Setup Guide

This guide provides complete instructions to set up a local Ollama instance with Apple's OpenELM model and use it via OpenAI/Anthropic compatible APIs.

Table of Contents

  1. Prerequisites
  2. Quick Start (One Command)
  3. Manual Setup
  4. Testing
  5. API Usage
  6. Docker Compose Setup
  7. Troubleshooting

Prerequisites

Required Software

Verify GPU Access

# Check NVIDIA driver
nvidia-smi

# Verify Docker can see GPU
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Quick Start (One Command)

For Linux/macOS

# Download and run the complete setup script
curl -O https://raw.githubusercontent.com/your-repo/setup_ollama_openelm.sh
chmod +x setup_ollama_openelm.sh
./setup_ollama_openelm.sh

For Windows (PowerShell)

# Run each command manually (see Manual Setup below)

Manual Setup

Step 1: Start Ollama Container

# Start Ollama with GPU support
docker run -d \
    --name ollama \
    -v ollama:/root/.ollama \
    -p 127.0.0.1:11434:11434 \
    --gpus all \
    ollama/ollama

# Verify it's running
docker ps | grep ollama

Step 2: Pull OpenELM Model

# Pull the 3B parameter model (2.1 GB)
docker exec -it ollama ollama pull apple/OpenELM-3B-Instruct

# Verify installation
docker exec ollama ollama list

Expected output:

NAME                             	ID          	SIZE  	MODIFIED
apple/OpenELM-3B-Instruct:latest	abc123...	2.1 GB	About a minute ago

Step 3: Install Python Dependencies

# Create virtual environment (optional but recommended)
python3 -m venv ollama_env
source ollama_env/bin/activate  # Linux/macOS
# or: .\ollama_env\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements_local.txt

Step 4: Run the API Server

# Start the FastAPI server (runs on port 8001)
python app_ollama.py

Or using uvicorn directly:

uvicorn app_ollama:app --host 0.0.0.0 --port 8001

Testing

Test 1: Verify Ollama is Running

# Check Ollama status
curl http://127.0.0.1:11434/api/tags

# Should return something like:
# {"models":[{"name":"apple/OpenELM-3B-Instruct"...}]}

Test 2: Quick Generation Test

# Test basic generation
curl http://127.0.0.1:11434/api/generate \
    -d '{"model": "apple/OpenELM-3B-Instruct", "prompt": "Say hello!", "stream": false}'

Test 3: Run Test Scripts

# Make test scripts executable
chmod +x test_curl.sh

# Run curl tests
./test_curl.sh

# Run Python tests (requires openai package)
python test_python.py

Test 4: Test the Full API Server

# Test OpenAI format endpoint
curl -X POST http://127.0.0.1:8001/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "apple/OpenELM-3B-Instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
    }'

# Test Anthropic format endpoint
curl -X POST http://127.0.0.1:8001/v1/messages \
    -H "Content-Type: application/json" \
    -d '{
        "model": "apple/OpenELM-3B-Instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
    }'

API Usage

Using OpenAI SDK (Python)

from openai import OpenAI

# Connect to local Ollama
client = OpenAI(
    base_url="http://127.0.0.1:11434/v1",
    api_key="ollama",  # Any string works
)

# Basic usage
response = client.chat.completions.create(
    model="apple/OpenELM-3B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing simply."}
    ],
    max_tokens=200,
    temperature=0.7
)

print(response.choices[0].message.content)

Using Anthropic SDK (Python)

import anthropic

# Connect to local Ollama (via API server)
client = anthropic.Anthropic(
    base_url="http://127.0.0.1:8001/v1",
    api_key="ollama",  # Any string works
)

# Basic usage
message = client.messages.create(
    model="apple/OpenELM-3B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)

print(message.content[0].text)

Using cURL

# Basic generation
curl http://127.0.0.1:11434/api/generate \
    -d '{"model": "apple/OpenELM-3B-Instruct", "prompt": "Your prompt here"}'

# Chat completion (OpenAI format)
curl http://127.0.0.1:8001/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "apple/OpenELM-3B-Instruct",
        "messages": [{"role": "user", "content": "Your prompt here"}],
        "max_tokens": 100
    }'

Docker Compose Setup

For easier deployment, use Docker Compose:

Step 1: Start All Services

# Start Ollama and API server together
docker-compose up -d

# View logs
docker-compose logs -f

Step 2: Access the Services

Step 3: Stop Services

docker-compose down

Troubleshooting

Issue: GPU Not Detected

Error: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

Solution:

# Install NVIDIA Container Toolkit
distribution=$(. /etc/ossa;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Issue: Model Download Fails

Error: Error: pull model manifest

Solution:

# Check network connection
curl -I https://huggingface.co

# Retry with verbose output
docker exec -it ollama ollama pull apple/OpenELM-3B-Instruct --verbose

Issue: API Server Can't Connect to Ollama

Error: Connection refused or Ollama not responding

Solution:

# Check if Ollama is running
docker ps | grep ollama

# Check Ollama logs
docker logs ollama

# Restart Ollama
docker restart ollama

Issue: Out of Memory

Error: CUDA out of memory

Solution:

  • Reduce max_tokens parameter
  • Use smaller batch sizes
  • Restart the Ollama container to free memory

Issue: Port Already in Use

Error: Address already in use

Solution:

# Find the process using the port
lsof -i :11434  # Linux/macOS
netstat -ano | findstr :11434  # Windows

# Kill the process or use a different port

File Structure

ollama-openelm/
β”œβ”€β”€ setup_ollama_openelm.sh    # Complete setup script
β”œβ”€β”€ app_ollama.py              # FastAPI server
β”œβ”€β”€ requirements_local.txt     # Python dependencies
β”œβ”€β”€ docker-compose.yml         # Docker Compose configuration
β”œβ”€β”€ Dockerfile.api             # API server Docker image
β”œβ”€β”€ test_python.py             # Python test script
β”œβ”€β”€ test_curl.sh               # cURL test script
└── README.md                  # This file

Environment Variables

For API Server

Variable Default Description
OLLAMA_BASE_URL http://127.0.0.1:11434 Ollama server URL
OLLAMA_MODEL apple/OpenELM-3B-Instruct Model name
PORT 8001 API server port

For Ollama Container

Variable Description
OLLAMA_HOST Override the Ollama server URL
OLLAMA_MODELS Path to model storage

Performance Tips

  1. GPU Memory: The 3B model uses ~6GB GPU memory
  2. CPU Inference: Falls back to CPU if no GPU available (slower)
  3. Batch Size: Use num_predict to control output length
  4. Temperature: Lower values (0.0-0.5) for more deterministic output

Additional Resources


License

This setup is provided for educational and research purposes. The OpenELM models from Apple are released under their respective licenses.