Instructions to use TheBloke/Codegen25-7B-mono-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TheBloke/Codegen25-7B-mono-GPTQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TheBloke/Codegen25-7B-mono-GPTQ")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Codegen25-7B-mono-GPTQ")
model = AutoModelForCausalLM.from_pretrained("TheBloke/Codegen25-7B-mono-GPTQ")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use TheBloke/Codegen25-7B-mono-GPTQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TheBloke/Codegen25-7B-mono-GPTQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/Codegen25-7B-mono-GPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/TheBloke/Codegen25-7B-mono-GPTQ

SGLang

How to use TheBloke/Codegen25-7B-mono-GPTQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TheBloke/Codegen25-7B-mono-GPTQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/Codegen25-7B-mono-GPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TheBloke/Codegen25-7B-mono-GPTQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/Codegen25-7B-mono-GPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use TheBloke/Codegen25-7B-mono-GPTQ with Docker Model Runner:
```
docker model run hf.co/TheBloke/Codegen25-7B-mono-GPTQ
```

text-generation-webui issue

by bonswouar - opened Jul 23, 2023

Discussion

bonswouar

Jul 23, 2023

When I try to load the model I get:
[...] env\lib\site-packages\transformers\models\auto\tokenization_auto.py”, line 699, in from_pretrained raise ValueError( ValueError: Tokenizer class CodeGen25Tokenizer does not exist or is not currently imported.
Did I miss something?

DewEfresh

Jul 24, 2023

I get the same error

TheBloke

Owner Jul 24, 2023

You need to install tiktoken - I included the install command in the README. If you're using the text-generation-webui one-click-installer, tiktoken will need to be installed in the conda environment created for text-generation-webui.

You also need Trust Remote Code = ticked, which I forgot to mention in the README. I'll add that now.

I've never tested this in text-generation-webui as it's a code generation model which isn't really suited for UI use, but I do believe it works if the above are done.

TheBloke

Owner Jul 24, 2023

I've updated the README to make the instructions clearer

Fusseldieb

Jul 24, 2023

•

edited Jul 24, 2023

Trust Remote Code = ticked solved it for me. However, I got some warnings:

2023-07-24 17:54:50 INFO:Loading TheBloke_Codegen25-7B-mono-GPTQ...
2023-07-24 17:54:50 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit-128g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': True, 'max_memory': {0: '7390MiB', 'cpu': '99GiB'}, 'quantize_config': None, 'use_cuda_fp16': True}
2023-07-24 17:54:52 WARNING:The safetensors archive passed at models\TheBloke_Codegen25-7B-mono-GPTQ\gptq_model-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
2023-07-24 17:54:56 WARNING:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
2023-07-24 17:54:56 INFO:Loaded the model in 5.81 seconds.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment