Codellama 34b base model fine-tuned on the text chunk from the OpenAssistant-Guanaco dataset instead of Q&A pair, so it struggles to determine the end of the answer. recommend using a stop string like "### Human:" to prevent the model from talking to itself.

Prompt template:

### Human: {prompt}
### Assistant:

Run the model via text-generation-inference

One GPU:

sudo docker run --gpus all --shm-size 1g -p 5000:80 -v $PWD/models:/data ghcr.io/huggingface/text-generation-inference:latest --max-total-tokens 4096 --quantize awq --model-id mzbac/CodeLlama-34b-guanaco-awq

Two GPUs:

docker run --gpus all --shm-size 1g -p 5000:80 -v $PWD/models:/data ghcr.io/huggingface/text-generation-inference:latest --max-total-tokens 4096 --max-input-length 4000 --max-batch-prefill-tokens 4096 --quantize awq --num-shard 2 --model-id mzbac/CodeLlama-34b-guanaco-awq

Query the mode via curl

curl 127.0.0.1:8001/generate \
    -X POST \
    -d '{"inputs":"### Human: 给我准备一个去日本旅行的计划\n### Assistant:", "parameters":{"max_new_tokens":2048, "stop": [
      "### Human:"
    ]}}' \
    -H 'Content-Type: application/json'
Downloads last month
1
Safetensors
Model size
34B params
Tensor type
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support