File size: 3,192 Bytes
fff9c06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# Usage Examples - FDA Task Classifier

## Basic Usage

### 1. Start the Server
```bash
./run_server.sh
```

### 2. Check Server Health
```bash
curl http://127.0.0.1:8000/health
```

### 3. Simple Completion
```bash
curl -X POST http://127.0.0.1:8000/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is terrible!\n\nResponse: ",
    "max_tokens": 100,
    "temperature": 0.7
  }'
```

### 4. Streaming Response
```bash
curl -X POST http://127.0.0.1:8000/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This sucks so bad!\n\nResponse: ",
    "max_tokens": 500,
    "temperature": 0.8,
    "stream": true
  }'
```

## Advanced Configuration

### Custom Server Settings
```bash
llama-server \
  -m model.gguf \
  --host 127.0.0.1 \
  --port 8000 \
  --n-gpu-layers 35 \
  --ctx-size 4096 \
  --threads 8 \
  --chat-template "" \
  --log-disable
```

### GPU Acceleration (macOS with Metal)
```bash
llama-server \
  -m model.gguf \
  --host 127.0.0.1 \
  --port 8000 \
  --n-gpu-layers 50 \
  --metal
```

### GPU Acceleration (Linux/Windows with CUDA)
```bash
llama-server \
  -m model.gguf \
  --host 127.0.0.1 \
  --port 8000 \
  --n-gpu-layers 50 \
  --cuda
```

## Python Client Example

```python
import requests
import json

def complete_with_model(prompt, max_tokens=200, temperature=0.7):
    url = "http://127.0.0.1:8000/completion"

    payload = {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature
    }

    headers = {
        'Content-Type': 'application/json'
    }

    response = requests.post(url, json=payload, headers=headers)

    if response.status_code == 200:
        result = response.json()
        return result['content']
    else:
        return f"Error: {response.status_code}"

# Example usage
prompt = "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is awful!\n\nResponse: "
response = complete_with_model(prompt)
print(response)
```

## Troubleshooting

### Common Issues

1. **Memory Errors**
   ```
   Error: not enough memory
   ```
   **Solution**: Reduce `--n-gpu-layers` to 0 or use a smaller value

2. **Context Window Too Large**
   ```
   Error: context size exceeded
   ```
   **Solution**: Reduce `--ctx-size` (e.g., `--ctx-size 2048`)

3. **CUDA Not Available**
   ```
   Error: CUDA not found
   ```
   **Solution**: Remove `--cuda` flag or install CUDA drivers

4. **Port Already in Use**
   ```
   Error: bind failed
   ```
   **Solution**: Use a different port with `--port 8001`

### Performance Tuning

- **For faster inference**: Increase `--n-gpu-layers`
- **For lower latency**: Reduce `--ctx-size`
- **For better quality**: Lower `--temperature` and increase `--top-p`
- **For creativity**: Increase `--temperature` and adjust `--top-k`

### System Requirements

- **RAM**: Minimum 8GB, recommended 16GB+
- **GPU**: Optional but recommended for better performance
- **Storage**: Model file size + 2x for temporary files

---
Generated on 2025-10-16 19:13:23