Instructions to use mzbac/CodeLlama-34b-guanaco-awq with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mzbac/CodeLlama-34b-guanaco-awq with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mzbac/CodeLlama-34b-guanaco-awq")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mzbac/CodeLlama-34b-guanaco-awq") model = AutoModelForCausalLM.from_pretrained("mzbac/CodeLlama-34b-guanaco-awq") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use mzbac/CodeLlama-34b-guanaco-awq with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mzbac/CodeLlama-34b-guanaco-awq" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mzbac/CodeLlama-34b-guanaco-awq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/mzbac/CodeLlama-34b-guanaco-awq
- SGLang
How to use mzbac/CodeLlama-34b-guanaco-awq with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mzbac/CodeLlama-34b-guanaco-awq" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mzbac/CodeLlama-34b-guanaco-awq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mzbac/CodeLlama-34b-guanaco-awq" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mzbac/CodeLlama-34b-guanaco-awq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use mzbac/CodeLlama-34b-guanaco-awq with Docker Model Runner:
docker model run hf.co/mzbac/CodeLlama-34b-guanaco-awq
Codellama 34b base model fine-tuned on the text chunk from the OpenAssistant-Guanaco dataset instead of Q&A pair, so it struggles to determine the end of the answer. recommend using a stop string like "### Human:" to prevent the model from talking to itself.
Prompt template:
### Human: {prompt}
### Assistant:
Run the model via text-generation-inference
One GPU:
sudo docker run --gpus all --shm-size 1g -p 5000:80 -v $PWD/models:/data ghcr.io/huggingface/text-generation-inference:latest --max-total-tokens 4096 --quantize awq --model-id mzbac/CodeLlama-34b-guanaco-awq
Two GPUs:
docker run --gpus all --shm-size 1g -p 5000:80 -v $PWD/models:/data ghcr.io/huggingface/text-generation-inference:latest --max-total-tokens 4096 --max-input-length 4000 --max-batch-prefill-tokens 4096 --quantize awq --num-shard 2 --model-id mzbac/CodeLlama-34b-guanaco-awq
Query the mode via curl
curl 127.0.0.1:8001/generate \
-X POST \
-d '{"inputs":"### Human: 给我准备一个去日本旅行的计划\n### Assistant:", "parameters":{"max_new_tokens":2048, "stop": [
"### Human:"
]}}' \
-H 'Content-Type: application/json'
- Downloads last month
- 4