Instructions to use TheBloke/deepseek-coder-33B-instruct-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TheBloke/deepseek-coder-33B-instruct-GPTQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TheBloke/deepseek-coder-33B-instruct-GPTQ") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TheBloke/deepseek-coder-33B-instruct-GPTQ") model = AutoModelForCausalLM.from_pretrained("TheBloke/deepseek-coder-33B-instruct-GPTQ") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TheBloke/deepseek-coder-33B-instruct-GPTQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TheBloke/deepseek-coder-33B-instruct-GPTQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/deepseek-coder-33B-instruct-GPTQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/TheBloke/deepseek-coder-33B-instruct-GPTQ
- SGLang
How to use TheBloke/deepseek-coder-33B-instruct-GPTQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TheBloke/deepseek-coder-33B-instruct-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/deepseek-coder-33B-instruct-GPTQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TheBloke/deepseek-coder-33B-instruct-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/deepseek-coder-33B-instruct-GPTQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use TheBloke/deepseek-coder-33B-instruct-GPTQ with Docker Model Runner:
docker model run hf.co/TheBloke/deepseek-coder-33B-instruct-GPTQ
Model doesnt work
INPUT:
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1
#settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])
max_new_tokens = 10
Prompt
prompt = f"""Write a working python code.
/#/#/# Instruction:
Write a working python code to generate 100 random numbers.
/#/#/# Response:
"""
input_ids = tokenizer.encode(prompt)
prompt_tokens = input_ids.shape[-1]
# Make sure CUDA is initialized so we can measure performance
generator.warmup()
# Send prompt to generator to begin stream
time_begin_prompt = time.time()
print (prompt, end = "")
sys.stdout.flush()
generator.set_stop_conditions([])
generator.begin_stream(input_ids, settings)
time_begin_stream = time.time()
generated_tokens = 0
while True:
chunk, eos, _ = generator.stream()
generated_tokens += 1
print (chunk, end = "")
sys.stdout.flush()
if eos or generated_tokens == max_new_tokens: break
time_end = time.time()
time_prompt = time_begin_stream - time_begin_prompt
time_tokens = time_end - time_begin_stream
print()
print()
print(f"Prompt processed in {time_prompt:.2f} seconds, {prompt_tokens} tokens, {prompt_tokens / time_prompt:.2f} tokens/second")
print(f"Response generated in {time_tokens:.2f} seconds, {generated_tokens} tokens, {generated_tokens / time_tokens:.2f} tokens/second")
OUTPUT:
Write a working python code.
/#/#/# Instruction:
Write a working python code to generate 100 random numbers.
/#/#/# Response:
Prompt processed in 0.00 seconds, 32 tokens, 27396.96 tokens/second
Response generated in 0.43 seconds, 10 tokens, 23.49 tokens/second"""
Okay. I had to manually set the rope_scale to 4.0. But gptq doesnt print EOS token.
Okay. I had to manually set the rope_scale to 4.0. But gptq doesnt print EOS token.
hi, i meet a similar issue with VLLM. Do you mean the root cause is rope_scale? where can i modify this? Thank you
I have an issue loading this modell with Text generation web ui. It gives me the error "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 125: character maps to ". Anyone an idea how to solve this?