ibm-ai-platform
/

codellama-13b-accelerator

Model card Files Files and versions

JRosenkranz commited on Apr 23, 2024

Commit

005b255

·

verified ·

1 Parent(s): e1d9017

Update README.md

Files changed (1) hide show

README.md +54 -6

README.md CHANGED Viewed

@@ -40,17 +40,45 @@ _Note: For all samples, your environment must have access to cuda_
 #### Setup
 ```bash
-docker pull quay.io/wxpe/text-gen-server:main.ee927a4
 docker run -d --rm --gpus all \
     --name my-tgis-server \
     -p 8033:8033 \
-    -v /path/to/all/models:/models \
-    -e MODEL_NAME=/models/model_weights/llama/CodeLlama-13b-Instruct-hf \
-    -e SPECULATOR_NAME=/models/speculator_weights/llama/codellama-13b-accelerator \
     -e FLASH_ATTENTION=true \
     -e PAGED_ATTENTION=true \
-    -e DTYPE_STR=float16 \
-    quay.io/wxpe/text-gen-server:main.ee927a4
 # check logs and wait for "gRPC server started on port 8033" and "HTTP server started on port 3000"
 docker logs my-tgis-server -f
@@ -74,6 +102,26 @@ _Note: first prompt may be slower as there is a slight warmup time_
 ### Minimal Sample
 #### Install
 ```bash

 #### Setup
 ```bash
+HF_HUB_CACHE=/hf_hub_cache
+HF_HUB_TOKEN="your huggingface hub token"
+TGIS_IMAGE=quay.io/wxpe/text-gen-server:main.ee927a4
+docker pull $TGIS_IMAGE
+# optionally download CodeLlama-13b-Instruct-hf if the weights do not already exist
+docker run --rm \
+    -v $HF_HUB_CACHE:/models \
+    -e HF_HUB_CACHE=/models \
+    -e TRANSFORMERS_CACHE=/models \
+    $TGIS_IMAGE \
+    text-generation-server download-weights \
+    codellama/CodeLlama-13b-Instruct-hf \
+    --token $HF_HUB_TOKEN
+# optionally download the speculator model if the weights do not already exist
+docker run --rm \
+    -v $HF_HUB_CACHE:/models \
+    -e HF_HUB_CACHE=/models \
+    -e TRANSFORMERS_CACHE=/models \
+    $TGIS_IMAGE \
+    text-generation-server download-weights \
+    ibm-fms/codellama-13b-accelerator \
+    --token $HF_HUB_TOKEN
+# note: if the weights were downloaded separately (not with the above commands), please place them in the HF_HUB_CACHE directoy and refer to them with /models/<model_name>
 docker run -d --rm --gpus all \
     --name my-tgis-server \
     -p 8033:8033 \
+    -v $HF_HUB_CACHE:/models \
+    -e HF_HUB_CACHE=/models \
+    -e TRANSFORMERS_CACHE=/models \
+    -e MODEL_NAME=codellama/CodeLlama-13b-Instruct-hf \
+    -e SPECULATOR_NAME=ibm-fms/codellama-13b-accelerator \
     -e FLASH_ATTENTION=true \
     -e PAGED_ATTENTION=true \
+    -e DTYPE=float16 \
+    $TGIS_IMAGE
 # check logs and wait for "gRPC server started on port 8033" and "HTTP server started on port 3000"
 docker logs my-tgis-server -f
 ### Minimal Sample
+*To try this out with the fms-native compiled model, please execute the following:*
+#### Install
+```bash
+git clone https://github.com/foundation-model-stack/fms-extras
+(cd fms-extras && pip install -e .)
+pip install transformers==4.35.0 sentencepiece numpy
+```
+#### Run Sample
+```bash
+python sample_client.py
+```
+_Note: first prompt may be slower as there is a slight warmup time_
+### Minimal Sample
 #### Install
 ```bash