Text Generation
Transformers
Safetensors
code
llama
llama-2
custom_code
text-generation-inference
4-bit precision
gptq
Instructions to use TheBloke/CodeLlama-7B-Python-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TheBloke/CodeLlama-7B-Python-GPTQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TheBloke/CodeLlama-7B-Python-GPTQ", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TheBloke/CodeLlama-7B-Python-GPTQ", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("TheBloke/CodeLlama-7B-Python-GPTQ", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TheBloke/CodeLlama-7B-Python-GPTQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TheBloke/CodeLlama-7B-Python-GPTQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/CodeLlama-7B-Python-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TheBloke/CodeLlama-7B-Python-GPTQ
- SGLang
How to use TheBloke/CodeLlama-7B-Python-GPTQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TheBloke/CodeLlama-7B-Python-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/CodeLlama-7B-Python-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TheBloke/CodeLlama-7B-Python-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/CodeLlama-7B-Python-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TheBloke/CodeLlama-7B-Python-GPTQ with Docker Model Runner:
docker model run hf.co/TheBloke/CodeLlama-7B-Python-GPTQ
Initial GPTQ model commit
Browse files
README.md
CHANGED
|
@@ -45,10 +45,12 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
|
|
| 45 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GGML)
|
| 46 |
* [Meta's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/CodeLlama-7B-Python-fp16)
|
| 47 |
|
| 48 |
-
## Prompt template:
|
| 49 |
|
| 50 |
```
|
| 51 |
-
|
|
|
|
|
|
|
| 52 |
```
|
| 53 |
|
| 54 |
## Provided files and GPTQ parameters
|
|
@@ -74,12 +76,12 @@ All GPTQ files are made with AutoGPTQ.
|
|
| 74 |
|
| 75 |
| Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
|
| 76 |
| ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
|
| 77 |
-
| [main](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/main) | 4 | 128 | No | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) |
|
| 78 |
-
| [gptq-4bit-32g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/gptq-4bit-32g-actorder_True) | 4 | 32 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) |
|
| 79 |
-
| [gptq-4bit-64g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/gptq-4bit-64g-actorder_True) | 4 | 64 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) |
|
| 80 |
-
| [gptq-4bit-128g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/gptq-4bit-128g-actorder_True) | 4 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) |
|
| 81 |
-
| [gptq-8bit--1g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/gptq-8bit--1g-actorder_True) | 8 | None | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) |
|
| 82 |
-
| [gptq-8bit-128g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/gptq-8bit-128g-actorder_True) | 8 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) |
|
| 83 |
|
| 84 |
## How to download from branches
|
| 85 |
|
|
@@ -139,7 +141,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
|
| 139 |
|
| 140 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
| 141 |
use_safetensors=True,
|
| 142 |
-
trust_remote_code=
|
| 143 |
device="cuda:0",
|
| 144 |
use_triton=use_triton,
|
| 145 |
quantize_config=None)
|
|
@@ -151,13 +153,15 @@ model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
|
| 151 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
| 152 |
revision="gptq-4bit-32g-actorder_True",
|
| 153 |
use_safetensors=True,
|
| 154 |
-
trust_remote_code=
|
| 155 |
device="cuda:0",
|
| 156 |
quantize_config=None)
|
| 157 |
"""
|
| 158 |
|
| 159 |
prompt = "Tell me about AI"
|
| 160 |
-
prompt_template=f'''
|
|
|
|
|
|
|
| 161 |
'''
|
| 162 |
|
| 163 |
print("\n\n*** Generate:")
|
|
@@ -214,7 +218,7 @@ Donaters will get priority support on any and all AI/LLM/model questions and req
|
|
| 214 |
|
| 215 |
**Special thanks to**: Aemon Algiz.
|
| 216 |
|
| 217 |
-
**Patreon special mentions**:
|
| 218 |
|
| 219 |
|
| 220 |
Thank you to all my generous patrons and donaters!
|
|
|
|
| 45 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GGML)
|
| 46 |
* [Meta's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/CodeLlama-7B-Python-fp16)
|
| 47 |
|
| 48 |
+
## Prompt template: CodeLlama
|
| 49 |
|
| 50 |
```
|
| 51 |
+
[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
|
| 52 |
+
{prompt}
|
| 53 |
+
[/INST]
|
| 54 |
```
|
| 55 |
|
| 56 |
## Provided files and GPTQ parameters
|
|
|
|
| 76 |
|
| 77 |
| Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
|
| 78 |
| ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
|
| 79 |
+
| [main](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/main) | 4 | 128 | No | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 4096 | 3.90 GB | Yes | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
|
| 80 |
+
| [gptq-4bit-32g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/gptq-4bit-32g-actorder_True) | 4 | 32 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 4096 | 4.28 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
|
| 81 |
+
| [gptq-4bit-64g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/gptq-4bit-64g-actorder_True) | 4 | 64 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 4096 | 4.02 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
| 82 |
+
| [gptq-4bit-128g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/gptq-4bit-128g-actorder_True) | 4 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 4096 | 3.90 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
| 83 |
+
| [gptq-8bit--1g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/gptq-8bit--1g-actorder_True) | 8 | None | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 4096 | 7.01 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
|
| 84 |
+
| [gptq-8bit-128g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-7B-Python-GPTQ/tree/gptq-8bit-128g-actorder_True) | 8 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 4096 | 7.16 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
|
| 85 |
|
| 86 |
## How to download from branches
|
| 87 |
|
|
|
|
| 141 |
|
| 142 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
| 143 |
use_safetensors=True,
|
| 144 |
+
trust_remote_code=False,
|
| 145 |
device="cuda:0",
|
| 146 |
use_triton=use_triton,
|
| 147 |
quantize_config=None)
|
|
|
|
| 153 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
| 154 |
revision="gptq-4bit-32g-actorder_True",
|
| 155 |
use_safetensors=True,
|
| 156 |
+
trust_remote_code=False,
|
| 157 |
device="cuda:0",
|
| 158 |
quantize_config=None)
|
| 159 |
"""
|
| 160 |
|
| 161 |
prompt = "Tell me about AI"
|
| 162 |
+
prompt_template=f'''[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
|
| 163 |
+
{prompt}
|
| 164 |
+
[/INST]
|
| 165 |
'''
|
| 166 |
|
| 167 |
print("\n\n*** Generate:")
|
|
|
|
| 218 |
|
| 219 |
**Special thanks to**: Aemon Algiz.
|
| 220 |
|
| 221 |
+
**Patreon special mentions**: Kacper Wikieł, knownsqashed, Leonard Tan, Asp the Wyvern, Daniel P. Andersen, Luke Pendergrass, Stanislav Ovsiannikov, RoA, Dave, Ai Maven, Kalila, Will Dee, Imad Khwaja, Nitin Borwankar, Joseph William Delisle, Tony Hughes, Cory Kujawski, Rishabh Srivastava, Russ Johnson, Stephen Murray, Lone Striker, Johann-Peter Hartmann, Elle, J, Deep Realms, SuperWojo, Raven Klaugh, Sebastain Graf, ReadyPlayerEmma, Alps Aficionado, Mano Prime, Derek Yates, Gabriel Puliatti, Mesiah Bishop, Magnesian, Sean Connelly, biorpg, Iucharbius, Olakabola, Fen Risland, Space Cruiser, theTransient, Illia Dulskyi, Thomas Belote, Spencer Kim, Pieter, John Detwiler, Fred von Graf, Michael Davis, Swaroop Kallakuri, subjectnull, Clay Pascal, Subspace Studios, Chris Smitley, Enrico Ros, usrbinkat, Steven Wood, alfie_i, David Ziegler, Willem Michiel, Matthew Berman, Andrey, Pyrater, Jeffrey Morgan, vamX, LangChain4j, Luke @flexchar, Trenton Dambrowitz, Pierre Kircher, Alex, Sam, James Bentley, Edmond Seymore, Eugene Pentland, Pedro Madruga, Rainer Wilmers, Dan Guido, Nathan LeClaire, Spiking Neurons AB, Talal Aujan, zynix, Artur Olbinski, Michael Levine, 阿明, K, John Villwock, Nikolai Manek, Femi Adebogun, senxiiz, Deo Leter, NimbleBox.ai, Viktor Bowallius, Geoffrey Montalvo, Mandus, Ajan Kanaga, ya boyyy, Jonathan Leane, webtim, Brandon Frisco, danny, Alexandros Triantafyllidis, Gabriel Tamborski, Randy H, terasurfer, Vadim, Junyu Yang, Vitor Caleffi, Chadd, transmissions 11
|
| 222 |
|
| 223 |
|
| 224 |
Thank you to all my generous patrons and donaters!
|