rgthelen's picture
Upload USAGE.md with huggingface_hub
fff9c06 verified
# Usage Examples - FDA Task Classifier
## Basic Usage
### 1. Start the Server
```bash
./run_server.sh
```
### 2. Check Server Health
```bash
curl http://127.0.0.1:8000/health
```
### 3. Simple Completion
```bash
curl -X POST http://127.0.0.1:8000/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is terrible!\n\nResponse: ",
"max_tokens": 100,
"temperature": 0.7
}'
```
### 4. Streaming Response
```bash
curl -X POST http://127.0.0.1:8000/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This sucks so bad!\n\nResponse: ",
"max_tokens": 500,
"temperature": 0.8,
"stream": true
}'
```
## Advanced Configuration
### Custom Server Settings
```bash
llama-server \
-m model.gguf \
--host 127.0.0.1 \
--port 8000 \
--n-gpu-layers 35 \
--ctx-size 4096 \
--threads 8 \
--chat-template "" \
--log-disable
```
### GPU Acceleration (macOS with Metal)
```bash
llama-server \
-m model.gguf \
--host 127.0.0.1 \
--port 8000 \
--n-gpu-layers 50 \
--metal
```
### GPU Acceleration (Linux/Windows with CUDA)
```bash
llama-server \
-m model.gguf \
--host 127.0.0.1 \
--port 8000 \
--n-gpu-layers 50 \
--cuda
```
## Python Client Example
```python
import requests
import json
def complete_with_model(prompt, max_tokens=200, temperature=0.7):
url = "http://127.0.0.1:8000/completion"
payload = {
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": temperature
}
headers = {
'Content-Type': 'application/json'
}
response = requests.post(url, json=payload, headers=headers)
if response.status_code == 200:
result = response.json()
return result['content']
else:
return f"Error: {response.status_code}"
# Example usage
prompt = "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is awful!\n\nResponse: "
response = complete_with_model(prompt)
print(response)
```
## Troubleshooting
### Common Issues
1. **Memory Errors**
```
Error: not enough memory
```
**Solution**: Reduce `--n-gpu-layers` to 0 or use a smaller value
2. **Context Window Too Large**
```
Error: context size exceeded
```
**Solution**: Reduce `--ctx-size` (e.g., `--ctx-size 2048`)
3. **CUDA Not Available**
```
Error: CUDA not found
```
**Solution**: Remove `--cuda` flag or install CUDA drivers
4. **Port Already in Use**
```
Error: bind failed
```
**Solution**: Use a different port with `--port 8001`
### Performance Tuning
- **For faster inference**: Increase `--n-gpu-layers`
- **For lower latency**: Reduce `--ctx-size`
- **For better quality**: Lower `--temperature` and increase `--top-p`
- **For creativity**: Increase `--temperature` and adjust `--top-k`
### System Requirements
- **RAM**: Minimum 8GB, recommended 16GB+
- **GPU**: Optional but recommended for better performance
- **Storage**: Model file size + 2x for temporary files
---
Generated on 2025-10-16 19:13:23