Instructions to use TheBloke/CodeLlama-13B-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TheBloke/CodeLlama-13B-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TheBloke/CodeLlama-13B-AWQ")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TheBloke/CodeLlama-13B-AWQ") model = AutoModelForCausalLM.from_pretrained("TheBloke/CodeLlama-13B-AWQ") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use TheBloke/CodeLlama-13B-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TheBloke/CodeLlama-13B-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/CodeLlama-13B-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TheBloke/CodeLlama-13B-AWQ
- SGLang
How to use TheBloke/CodeLlama-13B-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TheBloke/CodeLlama-13B-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/CodeLlama-13B-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TheBloke/CodeLlama-13B-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/CodeLlama-13B-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TheBloke/CodeLlama-13B-AWQ with Docker Model Runner:
docker model run hf.co/TheBloke/CodeLlama-13B-AWQ
codellama-13b-awq RuntimeError: CUDA error: an illegal memory access was encountered
When I use the above method for inference with Codellama, I encounter CUDA kernel errors. Please help me understand why?
WARNING: WatchFiles detected changes in 'fastapi_vllm_codellama.py'. Reloading...
INFO 10-31 16:58:55 llm_engine.py:72] Initializing an LLM engine with config: model='./CodeLlama-13B-AWQ', tokenizer='./CodeLlama-13B-AWQ', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-31 16:58:55 tokenizer.py:30] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
Process SpawnProcess-46:
Traceback (most recent call last):
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/_subprocess.py", line 76, in subprocess_started
target(sockets=sockets)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
return asyncio.run(self.serve(sockets=sockets))
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/server.py", line 68, in serve
config.load()
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/config.py", line 473, in load
self.loaded_app = import_from_string(self.app)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/importer.py", line 21, in import_from_string
module = importlib.import_module(module_str)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/mnt/gpu/code/fastapi_vllm_codellama.py", line 22, in
llm = LLM(model="./CodeLlama-13B-AWQ", quantization="awq")
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 89, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 229, in from_engine_args
engine = cls(*engine_configs,
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 111, in init
self._init_cache()
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 191, in _init_cache
num_blocks = self._run_workers(
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 692, in _run_workers
output = executor(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/worker/worker.py", line 109, in profile_num_available_blocks
self.model(
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 297, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 257, in forward
hidden_states = layer(
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 216, in forward
hidden_states = self.mlp(hidden_states)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 81, in forward
gate_up, _ = self.gate_up_proj(x)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 238, in forward
output_parallel = self.apply_weights(input_parallel, bias)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/layers/quantized_linear/awq.py", line 55, in apply_weights
out = quantization_ops.awq_gemm(reshaped_x, self.qweight, self.scales,
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.