Codellama 34b base model fine-tuned on the text chunk from the OpenAssistant-Guanaco dataset instead of Q&A pair, so it struggles to determine the end of the answer. recommend using a stop string like "### Human:" to prevent the model from talking to itself.
Prompt template:
### Human: {prompt}
### Assistant:
Run the model via text-generation-inference
One GPU:
sudo docker run --gpus all --shm-size 1g -p 5000:80 -v $PWD/models:/data ghcr.io/huggingface/text-generation-inference:latest --max-total-tokens 4096 --quantize awq --model-id mzbac/CodeLlama-34b-guanaco-awq
Two GPUs:
docker run --gpus all --shm-size 1g -p 5000:80 -v $PWD/models:/data ghcr.io/huggingface/text-generation-inference:latest --max-total-tokens 4096 --max-input-length 4000 --max-batch-prefill-tokens 4096 --quantize awq --num-shard 2 --model-id mzbac/CodeLlama-34b-guanaco-awq
Query the mode via curl
curl 127.0.0.1:8001/generate \
-X POST \
-d '{"inputs":"### Human: 给我准备一个去日本旅行的计划\n### Assistant:", "parameters":{"max_new_tokens":2048, "stop": [
"### Human:"
]}}' \
-H 'Content-Type: application/json'
- Downloads last month
- 1