Spaces:
Sleeping
Sleeping
| title: AGI Multi-Model API | |
| emoji: π» | |
| colorFrom: purple | |
| colorTo: green | |
| sdk: docker | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Multi-model LLM API with caching and web search | |
| # AGI Multi-Model API | |
| A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data. | |
| ## β¨ Features | |
| - **π Dynamic Model Switching**: Hot-swap between 5 different LLM models | |
| - **β‘ Intelligent Caching**: LRU cache keeps up to 2 models in memory for instant switching | |
| - **π Web-Augmented Chat**: Real-time web search integration via DuckDuckGo | |
| - **π OpenAI-Compatible API**: Drop-in replacement for OpenAI chat completions | |
| - **π Auto-Generated Documentation**: Interactive API docs with Swagger UI and ReDoc | |
| - **π Optimized Performance**: Continuous batching and multi-port architecture | |
| ## π€ Available Models | |
| | Model | Use Case | Size | | |
| |-------|----------|------| | |
| | **deepseek-chat** (default) | General purpose conversation | 7B | | |
| | **mistral-7b** | Financial analysis & summarization | 7B | | |
| | **openhermes-7b** | Advanced instruction following | 7B | | |
| | **deepseek-coder** | Specialized coding assistance | 6.7B | | |
| | **llama-7b** | Lightweight & fast responses | 7B | | |
| ## π Quick Start | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - `llama-server` (llama.cpp) | |
| - 8GB+ RAM (16GB+ recommended for caching multiple models) | |
| ### Installation | |
| ```bash | |
| # Clone the repository | |
| git clone <your-repo-url> | |
| cd AGI | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # or | |
| uv pip install -r pyproject.toml | |
| # Start the server | |
| uvicorn app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### Docker Deployment | |
| ```bash | |
| # Build the image | |
| docker build -t agi-api . | |
| # Run the container | |
| docker run -p 8000:8000 agi-api | |
| ``` | |
| ## π API Documentation | |
| Once the server is running, access the interactive documentation: | |
| - **Swagger UI**: http://localhost:8000/docs | |
| - **ReDoc**: http://localhost:8000/redoc | |
| - **OpenAPI JSON**: http://localhost:8000/openapi.json | |
| ## π§ Usage Examples | |
| ### Basic Chat Completion | |
| ```python | |
| import requests | |
| response = requests.post( | |
| "http://localhost:8000/v1/chat/completions", | |
| json={ | |
| "messages": [ | |
| {"role": "user", "content": "What is the capital of France?"} | |
| ], | |
| "max_tokens": 100, | |
| "temperature": 0.7 | |
| } | |
| ) | |
| print(response.json()) | |
| ``` | |
| ### Web-Augmented Chat | |
| ```python | |
| response = requests.post( | |
| "http://localhost:8000/v1/web-chat/completions", | |
| json={ | |
| "messages": [ | |
| {"role": "user", "content": "What are the latest AI developments?"} | |
| ], | |
| "max_tokens": 512, | |
| "max_search_results": 5 | |
| } | |
| ) | |
| result = response.json() | |
| print(result["choices"][0]["message"]["content"]) | |
| print(f"Sources: {result['web_search']['sources']}") | |
| ``` | |
| ### Switch Models | |
| ```python | |
| # Switch to coding model | |
| response = requests.post( | |
| "http://localhost:8000/switch-model", | |
| json={"model_name": "deepseek-coder"} | |
| ) | |
| print(response.json()) | |
| # Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"} | |
| ``` | |
| ### Check Cache Status | |
| ```python | |
| response = requests.get("http://localhost:8000/cache/info") | |
| cache_info = response.json() | |
| print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}") | |
| for model in cache_info['cached_models']: | |
| print(f" - {model['name']} on port {model['port']}") | |
| ``` | |
| ## ποΈ Architecture | |
| ### Model Caching System | |
| The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory: | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Request: Switch to Model A β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β 1. Check if A is current β Skip β | |
| β 2. Check cache for A β | |
| β ββ Cache Hit β Instant switch (< 1s) β | |
| β ββ Cache Miss β Load model (~2-3 min) β | |
| β 3. If cache full β Evict LRU model β | |
| β 4. Add A to cache β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| **Benefits:** | |
| - First load: ~2-3 minutes (model download + initialization) | |
| - Subsequent switches: < 1 second (from cache) | |
| - Automatic memory management with LRU eviction | |
| - Each model runs on a separate port (8080, 8081, etc.) | |
| ### Multi-Port Architecture | |
| ``` | |
| ββββββββββββββββββββββββ | |
| β FastAPI Server β | |
| β (Port 8000) β | |
| ββββββββββββ¬ββββββββββββ | |
| β | |
| ββββββββ΄βββββββββββββββββββββ | |
| β β | |
| βββββΌβββββββββββββ ββββββββββββΌβββββ | |
| β llama-server β β llama-server β | |
| β Model A:8080 β β Model B:8081 β | |
| ββββββββββββββββββ βββββββββββββββββ | |
| ``` | |
| ## βοΈ Configuration | |
| ### Environment Variables | |
| ```bash | |
| # Maximum cached models (default: 2) | |
| MAX_CACHED_MODELS=2 | |
| # Base port for llama-server instances (default: 8080) | |
| BASE_PORT=8080 | |
| # Default model on startup | |
| DEFAULT_MODEL=deepseek-chat | |
| ``` | |
| ### Model Configuration | |
| Edit `AVAILABLE_MODELS` in `app.py` to add custom models: | |
| ```python | |
| AVAILABLE_MODELS = { | |
| "my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf" | |
| } | |
| ``` | |
| ## π API Endpoints | |
| ### Status & Models | |
| - `GET /` - API status and current model | |
| - `GET /models` - List available models | |
| - `GET /cache/info` - Cache statistics and cached models | |
| ### Model Management | |
| - `POST /switch-model` - Switch active model (with caching) | |
| ### Chat Completions | |
| - `POST /v1/chat/completions` - Standard chat completions (OpenAI-compatible) | |
| - `POST /v1/web-chat/completions` - Web-augmented chat with search | |
| ### Documentation | |
| - `GET /docs` - Swagger UI interactive documentation | |
| - `GET /redoc` - ReDoc alternative documentation | |
| - `GET /openapi.json` - OpenAPI 3.0 specification export | |
| ## π§ͺ Testing | |
| ```bash | |
| # Test basic chat | |
| curl -X POST http://localhost:8000/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "messages": [{"role": "user", "content": "Hello!"}], | |
| "max_tokens": 50 | |
| }' | |
| # Check cache status | |
| curl http://localhost:8000/cache/info | |
| # Switch models | |
| curl -X POST http://localhost:8000/switch-model \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"model_name": "deepseek-coder"}' | |
| ``` | |
| ## π Web Search Integration | |
| The web-augmented chat endpoint automatically: | |
| 1. Extracts the user's query from the last message | |
| 2. Performs a DuckDuckGo web search | |
| 3. Injects search results into the LLM context | |
| 4. Returns response with source citations | |
| **Use cases:** | |
| - Current events and news | |
| - Recent developments beyond training data | |
| - Fact-checking with live web data | |
| - Research with source attribution | |
| ## π Performance Tips | |
| 1. **Cache Size**: Increase `MAX_CACHED_MODELS` if you have sufficient RAM (each model ~4-5GB) | |
| 2. **CPU Threads**: Adjust `-t` parameter in `start_llama_server()` based on your CPU cores | |
| 3. **Batch Size**: Modify `-b` parameter for throughput vs. latency tradeoff | |
| 4. **GPU Acceleration**: Set `-ngl` > 0 if you have a GPU (requires llama.cpp with GPU support) | |
| ## π οΈ Development | |
| ### Project Structure | |
| ``` | |
| AGI/ | |
| βββ app.py # Main FastAPI application | |
| βββ client_multi_model.py # Example client | |
| βββ Dockerfile # Docker configuration | |
| βββ pyproject.toml # Python dependencies | |
| βββ README.md # This file | |
| ``` | |
| ### Adding New Models | |
| 1. Find a GGUF model on HuggingFace | |
| 2. Add to `AVAILABLE_MODELS` dict | |
| 3. Restart the server | |
| 4. Switch to your new model via API | |
| ## π License | |
| Apache 2.0 - See LICENSE file for details | |
| ## π€ Contributing | |
| Contributions welcome! Please feel free to submit a Pull Request. | |
| ## π Troubleshooting | |
| ### Model fails to load | |
| - Ensure `llama-server` is in your PATH | |
| - Check available disk space for model downloads | |
| - Verify internet connectivity for first-time model downloads | |
| ### Out of memory errors | |
| - Reduce `MAX_CACHED_MODELS` to 1 | |
| - Use smaller quantized models (Q4_K_M instead of Q8) | |
| - Increase system swap space | |
| ### Port conflicts | |
| - Change `BASE_PORT` if 8080+ are in use | |
| - Check for other llama-server instances: `ps aux | grep llama` | |
| ## π Additional Resources | |
| - [FastAPI Documentation](https://fastapi.tiangolo.com/) | |
| - [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp) | |
| - [Hugging Face Models](https://huggingface.co/models?library=gguf) | |
| - [OpenAI API Reference](https://platform.openai.com/docs/api-reference) | |
| --- | |
| Built with β€οΈ using FastAPI and llama.cpp | |