llama-farm
/

fda-task-classifier-gguf

Text Generation

task-extraction

Model card Files Files and versions

fda-task-classifier-gguf / USAGE.md

rgthelen's picture

Upload USAGE.md with huggingface_hub

fff9c06 verified 3 months ago

|

history blame contribute delete

3.19 kB

	# Usage Examples - FDA Task Classifier

	## Basic Usage

	### 1. Start the Server
	```bash
	./run_server.sh
	```

	### 2. Check Server Health
	```bash
	curl http://127.0.0.1:8000/health
	```

	### 3. Simple Completion
	```bash
	curl -X POST http://127.0.0.1:8000/completion \
	-H "Content-Type: application/json" \
	-d '{
	"prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is terrible!\n\nResponse: ",
	"max_tokens": 100,
	"temperature": 0.7
	}'
	```

	### 4. Streaming Response
	```bash
	curl -X POST http://127.0.0.1:8000/completion \
	-H "Content-Type: application/json" \
	-d '{
	"prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This sucks so bad!\n\nResponse: ",
	"max_tokens": 500,
	"temperature": 0.8,
	"stream": true
	}'
	```

	## Advanced Configuration

	### Custom Server Settings
	```bash
	llama-server \
	-m model.gguf \
	--host 127.0.0.1 \
	--port 8000 \
	--n-gpu-layers 35 \
	--ctx-size 4096 \
	--threads 8 \
	--chat-template "" \
	--log-disable
	```

	### GPU Acceleration (macOS with Metal)
	```bash
	llama-server \
	-m model.gguf \
	--host 127.0.0.1 \
	--port 8000 \
	--n-gpu-layers 50 \
	--metal
	```

	### GPU Acceleration (Linux/Windows with CUDA)
	```bash
	llama-server \
	-m model.gguf \
	--host 127.0.0.1 \
	--port 8000 \
	--n-gpu-layers 50 \
	--cuda
	```

	## Python Client Example

	```python
	import requests
	import json

	def complete_with_model(prompt, max_tokens=200, temperature=0.7):
	url = "http://127.0.0.1:8000/completion"

	payload = {
	"prompt": prompt,
	"max_tokens": max_tokens,
	"temperature": temperature
	}

	headers = {
	'Content-Type': 'application/json'
	}

	response = requests.post(url, json=payload, headers=headers)

	if response.status_code == 200:
	result = response.json()
	return result['content']
	else:
	return f"Error: {response.status_code}"

	# Example usage
	prompt = "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is awful!\n\nResponse: "
	response = complete_with_model(prompt)
	print(response)
	```

	## Troubleshooting

	### Common Issues

	1. Memory Errors
	```
	Error: not enough memory
	```
	Solution: Reduce `--n-gpu-layers` to 0 or use a smaller value

	2. Context Window Too Large
	```
	Error: context size exceeded
	```
	Solution: Reduce `--ctx-size` (e.g., `--ctx-size 2048`)

	3. CUDA Not Available
	```
	Error: CUDA not found
	```
	Solution: Remove `--cuda` flag or install CUDA drivers

	4. Port Already in Use
	```
	Error: bind failed
	```
	Solution: Use a different port with `--port 8001`

	### Performance Tuning

	- For faster inference: Increase `--n-gpu-layers`
	- For lower latency: Reduce `--ctx-size`
	- For better quality: Lower `--temperature` and increase `--top-p`
	- For creativity: Increase `--temperature` and adjust `--top-k`

	### System Requirements

	- RAM: Minimum 8GB, recommended 16GB+
	- GPU: Optional but recommended for better performance
	- Storage: Model file size + 2x for temporary files

	---
	Generated on 2025-10-16 19:13:23