Instructions to use TheBloke/CodeLlama-13B-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TheBloke/CodeLlama-13B-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TheBloke/CodeLlama-13B-AWQ")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/CodeLlama-13B-AWQ")
model = AutoModelForCausalLM.from_pretrained("TheBloke/CodeLlama-13B-AWQ", device_map="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use TheBloke/CodeLlama-13B-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TheBloke/CodeLlama-13B-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/CodeLlama-13B-AWQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/TheBloke/CodeLlama-13B-AWQ

SGLang

How to use TheBloke/CodeLlama-13B-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TheBloke/CodeLlama-13B-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/CodeLlama-13B-AWQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TheBloke/CodeLlama-13B-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/CodeLlama-13B-AWQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use TheBloke/CodeLlama-13B-AWQ with Docker Model Runner:
```
docker model run hf.co/TheBloke/CodeLlama-13B-AWQ
```

codellama-13b-awq RuntimeError: CUDA error: an illegal memory access was encountered

by leojames - opened Oct 31, 2023

Discussion

leojames

Oct 31, 2023

When I use the above method for inference with Codellama, I encounter CUDA kernel errors. Please help me understand why?

WARNING: WatchFiles detected changes in 'fastapi_vllm_codellama.py'. Reloading...
INFO 10-31 16:58:55 llm_engine.py:72] Initializing an LLM engine with config: model='./CodeLlama-13B-AWQ', tokenizer='./CodeLlama-13B-AWQ', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-31 16:58:55 tokenizer.py:30] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
Process SpawnProcess-46:
Traceback (most recent call last):
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/_subprocess.py", line 76, in subprocess_started
target(sockets=sockets)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
return asyncio.run(self.serve(sockets=sockets))
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/server.py", line 68, in serve
config.load()
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/config.py", line 473, in load
self.loaded_app = import_from_string(self.app)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/importer.py", line 21, in import_from_string
module = importlib.import_module(module_str)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/mnt/gpu/code/fastapi_vllm_codellama.py", line 22, in
llm = LLM(model="./CodeLlama-13B-AWQ", quantization="awq")
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 89, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 229, in from_engine_args
engine = cls(*engine_configs,
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 111, in init
self._init_cache()
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 191, in _init_cache
num_blocks = self._run_workers(
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 692, in _run_workers
output = executor(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/worker/worker.py", line 109, in profile_num_available_blocks
self.model(
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 297, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 257, in forward
hidden_states = layer(
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 216, in forward
hidden_states = self.mlp(hidden_states)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 81, in forward
gate_up, _ = self.gate_up_proj(x)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 238, in forward
output_parallel = self.apply_weights(input_parallel, bias)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/layers/quantized_linear/awq.py", line 55, in apply_weights
out = quantization_ops.awq_gemm(reshaped_x, self.qweight, self.scales,
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment