File size: 3,192 Bytes
fff9c06 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
# Usage Examples - FDA Task Classifier
## Basic Usage
### 1. Start the Server
```bash
./run_server.sh
```
### 2. Check Server Health
```bash
curl http://127.0.0.1:8000/health
```
### 3. Simple Completion
```bash
curl -X POST http://127.0.0.1:8000/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is terrible!\n\nResponse: ",
"max_tokens": 100,
"temperature": 0.7
}'
```
### 4. Streaming Response
```bash
curl -X POST http://127.0.0.1:8000/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This sucks so bad!\n\nResponse: ",
"max_tokens": 500,
"temperature": 0.8,
"stream": true
}'
```
## Advanced Configuration
### Custom Server Settings
```bash
llama-server \
-m model.gguf \
--host 127.0.0.1 \
--port 8000 \
--n-gpu-layers 35 \
--ctx-size 4096 \
--threads 8 \
--chat-template "" \
--log-disable
```
### GPU Acceleration (macOS with Metal)
```bash
llama-server \
-m model.gguf \
--host 127.0.0.1 \
--port 8000 \
--n-gpu-layers 50 \
--metal
```
### GPU Acceleration (Linux/Windows with CUDA)
```bash
llama-server \
-m model.gguf \
--host 127.0.0.1 \
--port 8000 \
--n-gpu-layers 50 \
--cuda
```
## Python Client Example
```python
import requests
import json
def complete_with_model(prompt, max_tokens=200, temperature=0.7):
url = "http://127.0.0.1:8000/completion"
payload = {
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": temperature
}
headers = {
'Content-Type': 'application/json'
}
response = requests.post(url, json=payload, headers=headers)
if response.status_code == 200:
result = response.json()
return result['content']
else:
return f"Error: {response.status_code}"
# Example usage
prompt = "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is awful!\n\nResponse: "
response = complete_with_model(prompt)
print(response)
```
## Troubleshooting
### Common Issues
1. **Memory Errors**
```
Error: not enough memory
```
**Solution**: Reduce `--n-gpu-layers` to 0 or use a smaller value
2. **Context Window Too Large**
```
Error: context size exceeded
```
**Solution**: Reduce `--ctx-size` (e.g., `--ctx-size 2048`)
3. **CUDA Not Available**
```
Error: CUDA not found
```
**Solution**: Remove `--cuda` flag or install CUDA drivers
4. **Port Already in Use**
```
Error: bind failed
```
**Solution**: Use a different port with `--port 8001`
### Performance Tuning
- **For faster inference**: Increase `--n-gpu-layers`
- **For lower latency**: Reduce `--ctx-size`
- **For better quality**: Lower `--temperature` and increase `--top-p`
- **For creativity**: Increase `--temperature` and adjust `--top-k`
### System Requirements
- **RAM**: Minimum 8GB, recommended 16GB+
- **GPU**: Optional but recommended for better performance
- **Storage**: Model file size + 2x for temporary files
---
Generated on 2025-10-16 19:13:23
|