AGI / README.md
Dmitry Beresnev
add readme
c384ef1
---
title: AGI Multi-Model API
emoji: 😻
colorFrom: purple
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: Multi-model LLM API with caching and web search
---
# AGI Multi-Model API
A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data.
## ✨ Features
- **πŸ”„ Dynamic Model Switching**: Hot-swap between 5 different LLM models
- **⚑ Intelligent Caching**: LRU cache keeps up to 2 models in memory for instant switching
- **🌐 Web-Augmented Chat**: Real-time web search integration via DuckDuckGo
- **πŸ“š OpenAI-Compatible API**: Drop-in replacement for OpenAI chat completions
- **πŸ“– Auto-Generated Documentation**: Interactive API docs with Swagger UI and ReDoc
- **πŸš€ Optimized Performance**: Continuous batching and multi-port architecture
## πŸ€– Available Models
| Model | Use Case | Size |
|-------|----------|------|
| **deepseek-chat** (default) | General purpose conversation | 7B |
| **mistral-7b** | Financial analysis & summarization | 7B |
| **openhermes-7b** | Advanced instruction following | 7B |
| **deepseek-coder** | Specialized coding assistance | 6.7B |
| **llama-7b** | Lightweight & fast responses | 7B |
## πŸš€ Quick Start
### Prerequisites
- Python 3.10+
- `llama-server` (llama.cpp)
- 8GB+ RAM (16GB+ recommended for caching multiple models)
### Installation
```bash
# Clone the repository
git clone <your-repo-url>
cd AGI
# Install dependencies
pip install -r requirements.txt
# or
uv pip install -r pyproject.toml
# Start the server
uvicorn app:app --host 0.0.0.0 --port 8000
```
### Docker Deployment
```bash
# Build the image
docker build -t agi-api .
# Run the container
docker run -p 8000:8000 agi-api
```
## πŸ“– API Documentation
Once the server is running, access the interactive documentation:
- **Swagger UI**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
- **OpenAPI JSON**: http://localhost:8000/openapi.json
## πŸ”§ Usage Examples
### Basic Chat Completion
```python
import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 100,
"temperature": 0.7
}
)
print(response.json())
```
### Web-Augmented Chat
```python
response = requests.post(
"http://localhost:8000/v1/web-chat/completions",
json={
"messages": [
{"role": "user", "content": "What are the latest AI developments?"}
],
"max_tokens": 512,
"max_search_results": 5
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])
print(f"Sources: {result['web_search']['sources']}")
```
### Switch Models
```python
# Switch to coding model
response = requests.post(
"http://localhost:8000/switch-model",
json={"model_name": "deepseek-coder"}
)
print(response.json())
# Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"}
```
### Check Cache Status
```python
response = requests.get("http://localhost:8000/cache/info")
cache_info = response.json()
print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}")
for model in cache_info['cached_models']:
print(f" - {model['name']} on port {model['port']}")
```
## πŸ—οΈ Architecture
### Model Caching System
The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Request: Switch to Model A β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1. Check if A is current β†’ Skip β”‚
β”‚ 2. Check cache for A β”‚
β”‚ β”œβ”€ Cache Hit β†’ Instant switch (< 1s) β”‚
β”‚ └─ Cache Miss β†’ Load model (~2-3 min) β”‚
β”‚ 3. If cache full β†’ Evict LRU model β”‚
β”‚ 4. Add A to cache β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Benefits:**
- First load: ~2-3 minutes (model download + initialization)
- Subsequent switches: < 1 second (from cache)
- Automatic memory management with LRU eviction
- Each model runs on a separate port (8080, 8081, etc.)
### Multi-Port Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FastAPI Server β”‚
β”‚ (Port 8000) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚
β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚ llama-server β”‚ β”‚ llama-server β”‚
β”‚ Model A:8080 β”‚ β”‚ Model B:8081 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## βš™οΈ Configuration
### Environment Variables
```bash
# Maximum cached models (default: 2)
MAX_CACHED_MODELS=2
# Base port for llama-server instances (default: 8080)
BASE_PORT=8080
# Default model on startup
DEFAULT_MODEL=deepseek-chat
```
### Model Configuration
Edit `AVAILABLE_MODELS` in `app.py` to add custom models:
```python
AVAILABLE_MODELS = {
"my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf"
}
```
## πŸ“Š API Endpoints
### Status & Models
- `GET /` - API status and current model
- `GET /models` - List available models
- `GET /cache/info` - Cache statistics and cached models
### Model Management
- `POST /switch-model` - Switch active model (with caching)
### Chat Completions
- `POST /v1/chat/completions` - Standard chat completions (OpenAI-compatible)
- `POST /v1/web-chat/completions` - Web-augmented chat with search
### Documentation
- `GET /docs` - Swagger UI interactive documentation
- `GET /redoc` - ReDoc alternative documentation
- `GET /openapi.json` - OpenAPI 3.0 specification export
## πŸ§ͺ Testing
```bash
# Test basic chat
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
# Check cache status
curl http://localhost:8000/cache/info
# Switch models
curl -X POST http://localhost:8000/switch-model \
-H "Content-Type: application/json" \
-d '{"model_name": "deepseek-coder"}'
```
## πŸ” Web Search Integration
The web-augmented chat endpoint automatically:
1. Extracts the user's query from the last message
2. Performs a DuckDuckGo web search
3. Injects search results into the LLM context
4. Returns response with source citations
**Use cases:**
- Current events and news
- Recent developments beyond training data
- Fact-checking with live web data
- Research with source attribution
## πŸ“ˆ Performance Tips
1. **Cache Size**: Increase `MAX_CACHED_MODELS` if you have sufficient RAM (each model ~4-5GB)
2. **CPU Threads**: Adjust `-t` parameter in `start_llama_server()` based on your CPU cores
3. **Batch Size**: Modify `-b` parameter for throughput vs. latency tradeoff
4. **GPU Acceleration**: Set `-ngl` > 0 if you have a GPU (requires llama.cpp with GPU support)
## πŸ› οΈ Development
### Project Structure
```
AGI/
β”œβ”€β”€ app.py # Main FastAPI application
β”œβ”€β”€ client_multi_model.py # Example client
β”œβ”€β”€ Dockerfile # Docker configuration
β”œβ”€β”€ pyproject.toml # Python dependencies
└── README.md # This file
```
### Adding New Models
1. Find a GGUF model on HuggingFace
2. Add to `AVAILABLE_MODELS` dict
3. Restart the server
4. Switch to your new model via API
## πŸ“ License
Apache 2.0 - See LICENSE file for details
## 🀝 Contributing
Contributions welcome! Please feel free to submit a Pull Request.
## πŸ› Troubleshooting
### Model fails to load
- Ensure `llama-server` is in your PATH
- Check available disk space for model downloads
- Verify internet connectivity for first-time model downloads
### Out of memory errors
- Reduce `MAX_CACHED_MODELS` to 1
- Use smaller quantized models (Q4_K_M instead of Q8)
- Increase system swap space
### Port conflicts
- Change `BASE_PORT` if 8080+ are in use
- Check for other llama-server instances: `ps aux | grep llama`
## πŸ“š Additional Resources
- [FastAPI Documentation](https://fastapi.tiangolo.com/)
- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
- [Hugging Face Models](https://huggingface.co/models?library=gguf)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)
---
Built with ❀️ using FastAPI and llama.cpp