Spaces:

ResearchEngineering
/

AGI

Sleeping

App Files Files Community

AGI / README.md

Dmitry Beresnev

add readme

c384ef1 11 days ago

preview code

raw

history blame contribute delete

9.16 kB

	---
	title: AGI Multi-Model API
	emoji: 😻
	colorFrom: purple
	colorTo: green
	sdk: docker
	pinned: false
	license: apache-2.0
	short_description: Multi-model LLM API with caching and web search
	---

	# AGI Multi-Model API

	A high-performance FastAPI server for running multiple LLM models with intelligent in-memory caching and web search capabilities. Switch between models instantly and augment responses with real-time web data.

	## ✨ Features

	- 🔄 Dynamic Model Switching: Hot-swap between 5 different LLM models
	- ⚡ Intelligent Caching: LRU cache keeps up to 2 models in memory for instant switching
	- 🌐 Web-Augmented Chat: Real-time web search integration via DuckDuckGo
	- 📚 OpenAI-Compatible API: Drop-in replacement for OpenAI chat completions
	- 📖 Auto-Generated Documentation: Interactive API docs with Swagger UI and ReDoc
	- 🚀 Optimized Performance: Continuous batching and multi-port architecture

	## 🤖 Available Models

	\| Model \| Use Case \| Size \|
	\|-------\|----------\|------\|
	\| deepseek-chat (default) \| General purpose conversation \| 7B \|
	\| mistral-7b \| Financial analysis & summarization \| 7B \|
	\| openhermes-7b \| Advanced instruction following \| 7B \|
	\| deepseek-coder \| Specialized coding assistance \| 6.7B \|
	\| llama-7b \| Lightweight & fast responses \| 7B \|

	## 🚀 Quick Start

	### Prerequisites

	- Python 3.10+
	- `llama-server` (llama.cpp)
	- 8GB+ RAM (16GB+ recommended for caching multiple models)

	### Installation

	```bash
	# Clone the repository
	git clone <your-repo-url>
	cd AGI

	# Install dependencies
	pip install -r requirements.txt
	# or
	uv pip install -r pyproject.toml

	# Start the server
	uvicorn app:app --host 0.0.0.0 --port 8000
	```

	### Docker Deployment

	```bash
	# Build the image
	docker build -t agi-api .

	# Run the container
	docker run -p 8000:8000 agi-api
	```

	## 📖 API Documentation

	Once the server is running, access the interactive documentation:

	- Swagger UI: http://localhost:8000/docs
	- ReDoc: http://localhost:8000/redoc
	- OpenAPI JSON: http://localhost:8000/openapi.json

	## 🔧 Usage Examples

	### Basic Chat Completion

	```python
	import requests

	response = requests.post(
	"http://localhost:8000/v1/chat/completions",
	json={
	"messages": [
	{"role": "user", "content": "What is the capital of France?"}
	],
	"max_tokens": 100,
	"temperature": 0.7
	}
	)
	print(response.json())
	```

	### Web-Augmented Chat

	```python
	response = requests.post(
	"http://localhost:8000/v1/web-chat/completions",
	json={
	"messages": [
	{"role": "user", "content": "What are the latest AI developments?"}
	],
	"max_tokens": 512,
	"max_search_results": 5
	}
	)
	result = response.json()
	print(result["choices"][0]["message"]["content"])
	print(f"Sources: {result['web_search']['sources']}")
	```

	### Switch Models

	```python
	# Switch to coding model
	response = requests.post(
	"http://localhost:8000/switch-model",
	json={"model_name": "deepseek-coder"}
	)
	print(response.json())
	# Output: {"message": "Switched to model: deepseek-coder (from cache)", "model": "deepseek-coder"}
	```

	### Check Cache Status

	```python
	response = requests.get("http://localhost:8000/cache/info")
	cache_info = response.json()
	print(f"Cached models: {cache_info['current_size']}/{cache_info['max_size']}")
	for model in cache_info['cached_models']:
	print(f" - {model['name']} on port {model['port']}")
	```

	## 🏗️ Architecture

	### Model Caching System

	The API uses an intelligent LRU (Least Recently Used) cache to manage models in memory:

	```
	┌─────────────────────────────────────────────────┐
	│ Request: Switch to Model A │
	├─────────────────────────────────────────────────┤
	│ 1. Check if A is current → Skip │
	│ 2. Check cache for A │
	│ ├─ Cache Hit → Instant switch (< 1s) │
	│ └─ Cache Miss → Load model (~2-3 min) │
	│ 3. If cache full → Evict LRU model │
	│ 4. Add A to cache │
	└─────────────────────────────────────────────────┘
	```

	Benefits:
	- First load: ~2-3 minutes (model download + initialization)
	- Subsequent switches: < 1 second (from cache)
	- Automatic memory management with LRU eviction
	- Each model runs on a separate port (8080, 8081, etc.)

	### Multi-Port Architecture

	```
	┌──────────────────────┐
	│ FastAPI Server │
	│ (Port 8000) │
	└──────────┬───────────┘
	│
	┌──────┴────────────────────┐
	│ │
	┌───▼────────────┐ ┌──────────▼────┐
	│ llama-server │ │ llama-server │
	│ Model A:8080 │ │ Model B:8081 │
	└────────────────┘ └───────────────┘
	```

	## ⚙️ Configuration

	### Environment Variables

	```bash
	# Maximum cached models (default: 2)
	MAX_CACHED_MODELS=2

	# Base port for llama-server instances (default: 8080)
	BASE_PORT=8080

	# Default model on startup
	DEFAULT_MODEL=deepseek-chat
	```

	### Model Configuration

	Edit `AVAILABLE_MODELS` in `app.py` to add custom models:

	```python
	AVAILABLE_MODELS = {
	"my-model": "username/model-name-GGUF:model-file.Q4_K_M.gguf"
	}
	```

	## 📊 API Endpoints

	### Status & Models

	- `GET /` - API status and current model
	- `GET /models` - List available models
	- `GET /cache/info` - Cache statistics and cached models

	### Model Management

	- `POST /switch-model` - Switch active model (with caching)

	### Chat Completions

	- `POST /v1/chat/completions` - Standard chat completions (OpenAI-compatible)
	- `POST /v1/web-chat/completions` - Web-augmented chat with search

	### Documentation

	- `GET /docs` - Swagger UI interactive documentation
	- `GET /redoc` - ReDoc alternative documentation
	- `GET /openapi.json` - OpenAPI 3.0 specification export

	## 🧪 Testing

	```bash
	# Test basic chat
	curl -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"messages": [{"role": "user", "content": "Hello!"}],
	"max_tokens": 50
	}'

	# Check cache status
	curl http://localhost:8000/cache/info

	# Switch models
	curl -X POST http://localhost:8000/switch-model \
	-H "Content-Type: application/json" \
	-d '{"model_name": "deepseek-coder"}'
	```

	## 🔍 Web Search Integration

	The web-augmented chat endpoint automatically:
	1. Extracts the user's query from the last message
	2. Performs a DuckDuckGo web search
	3. Injects search results into the LLM context
	4. Returns response with source citations

	Use cases:
	- Current events and news
	- Recent developments beyond training data
	- Fact-checking with live web data
	- Research with source attribution

	## 📈 Performance Tips

	1. Cache Size: Increase `MAX_CACHED_MODELS` if you have sufficient RAM (each model ~4-5GB)
	2. CPU Threads: Adjust `-t` parameter in `start_llama_server()` based on your CPU cores
	3. Batch Size: Modify `-b` parameter for throughput vs. latency tradeoff
	4. GPU Acceleration: Set `-ngl` > 0 if you have a GPU (requires llama.cpp with GPU support)

	## 🛠️ Development

	### Project Structure

	```
	AGI/
	├── app.py # Main FastAPI application
	├── client_multi_model.py # Example client
	├── Dockerfile # Docker configuration
	├── pyproject.toml # Python dependencies
	└── README.md # This file
	```

	### Adding New Models

	1. Find a GGUF model on HuggingFace
	2. Add to `AVAILABLE_MODELS` dict
	3. Restart the server
	4. Switch to your new model via API

	## 📝 License

	Apache 2.0 - See LICENSE file for details

	## 🤝 Contributing

	Contributions welcome! Please feel free to submit a Pull Request.

	## 🐛 Troubleshooting

	### Model fails to load
	- Ensure `llama-server` is in your PATH
	- Check available disk space for model downloads
	- Verify internet connectivity for first-time model downloads

	### Out of memory errors
	- Reduce `MAX_CACHED_MODELS` to 1
	- Use smaller quantized models (Q4_K_M instead of Q8)
	- Increase system swap space

	### Port conflicts
	- Change `BASE_PORT` if 8080+ are in use
	- Check for other llama-server instances: `ps aux \| grep llama`

	## 📚 Additional Resources

	- [FastAPI Documentation](https://fastapi.tiangolo.com/)
	- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
	- [Hugging Face Models](https://huggingface.co/models?library=gguf)
	- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)

	---

	Built with ❤️ using FastAPI and llama.cpp