| # Usage Examples - FDA Task Classifier | |
| ## Basic Usage | |
| ### 1. Start the Server | |
| ```bash | |
| ./run_server.sh | |
| ``` | |
| ### 2. Check Server Health | |
| ```bash | |
| curl http://127.0.0.1:8000/health | |
| ``` | |
| ### 3. Simple Completion | |
| ```bash | |
| curl -X POST http://127.0.0.1:8000/completion \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is terrible!\n\nResponse: ", | |
| "max_tokens": 100, | |
| "temperature": 0.7 | |
| }' | |
| ``` | |
| ### 4. Streaming Response | |
| ```bash | |
| curl -X POST http://127.0.0.1:8000/completion \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This sucks so bad!\n\nResponse: ", | |
| "max_tokens": 500, | |
| "temperature": 0.8, | |
| "stream": true | |
| }' | |
| ``` | |
| ## Advanced Configuration | |
| ### Custom Server Settings | |
| ```bash | |
| llama-server \ | |
| -m model.gguf \ | |
| --host 127.0.0.1 \ | |
| --port 8000 \ | |
| --n-gpu-layers 35 \ | |
| --ctx-size 4096 \ | |
| --threads 8 \ | |
| --chat-template "" \ | |
| --log-disable | |
| ``` | |
| ### GPU Acceleration (macOS with Metal) | |
| ```bash | |
| llama-server \ | |
| -m model.gguf \ | |
| --host 127.0.0.1 \ | |
| --port 8000 \ | |
| --n-gpu-layers 50 \ | |
| --metal | |
| ``` | |
| ### GPU Acceleration (Linux/Windows with CUDA) | |
| ```bash | |
| llama-server \ | |
| -m model.gguf \ | |
| --host 127.0.0.1 \ | |
| --port 8000 \ | |
| --n-gpu-layers 50 \ | |
| --cuda | |
| ``` | |
| ## Python Client Example | |
| ```python | |
| import requests | |
| import json | |
| def complete_with_model(prompt, max_tokens=200, temperature=0.7): | |
| url = "http://127.0.0.1:8000/completion" | |
| payload = { | |
| "prompt": prompt, | |
| "max_tokens": max_tokens, | |
| "temperature": temperature | |
| } | |
| headers = { | |
| 'Content-Type': 'application/json' | |
| } | |
| response = requests.post(url, json=payload, headers=headers) | |
| if response.status_code == 200: | |
| result = response.json() | |
| return result['content'] | |
| else: | |
| return f"Error: {response.status_code}" | |
| # Example usage | |
| prompt = "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is awful!\n\nResponse: " | |
| response = complete_with_model(prompt) | |
| print(response) | |
| ``` | |
| ## Troubleshooting | |
| ### Common Issues | |
| 1. **Memory Errors** | |
| ``` | |
| Error: not enough memory | |
| ``` | |
| **Solution**: Reduce `--n-gpu-layers` to 0 or use a smaller value | |
| 2. **Context Window Too Large** | |
| ``` | |
| Error: context size exceeded | |
| ``` | |
| **Solution**: Reduce `--ctx-size` (e.g., `--ctx-size 2048`) | |
| 3. **CUDA Not Available** | |
| ``` | |
| Error: CUDA not found | |
| ``` | |
| **Solution**: Remove `--cuda` flag or install CUDA drivers | |
| 4. **Port Already in Use** | |
| ``` | |
| Error: bind failed | |
| ``` | |
| **Solution**: Use a different port with `--port 8001` | |
| ### Performance Tuning | |
| - **For faster inference**: Increase `--n-gpu-layers` | |
| - **For lower latency**: Reduce `--ctx-size` | |
| - **For better quality**: Lower `--temperature` and increase `--top-p` | |
| - **For creativity**: Increase `--temperature` and adjust `--top-k` | |
| ### System Requirements | |
| - **RAM**: Minimum 8GB, recommended 16GB+ | |
| - **GPU**: Optional but recommended for better performance | |
| - **Storage**: Model file size + 2x for temporary files | |
| --- | |
| Generated on 2025-10-16 19:13:23 | |