Instructions to use togethercomputer/GPT-JT-6B-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use togethercomputer/GPT-JT-6B-v1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="togethercomputer/GPT-JT-6B-v1")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use togethercomputer/GPT-JT-6B-v1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "togethercomputer/GPT-JT-6B-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/GPT-JT-6B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/togethercomputer/GPT-JT-6B-v1

SGLang

How to use togethercomputer/GPT-JT-6B-v1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "togethercomputer/GPT-JT-6B-v1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/GPT-JT-6B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "togethercomputer/GPT-JT-6B-v1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/GPT-JT-6B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use togethercomputer/GPT-JT-6B-v1 with Docker Model Runner:
```
docker model run hf.co/togethercomputer/GPT-JT-6B-v1
```

Hardware requirements for inference?

by spartanml - opened Dec 5, 2022

Discussion

spartanml

Dec 5, 2022

Where can I find the hardware requirements for this model? (Specifically, can it run on 3060/12GB)?

juewang

Together org Dec 7, 2022

Theoretically, GPT-JT cannot run on one single 3060 12GB as the model itself takes up ~12GB and thus so there is not enough memory for inference. I'll recommend VRAM >= 16GB. An alternative is to use multiple 3060 GPUs with accelerate:

from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory

# Load model to CPU
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["GPTJBlock"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model, 
    max_memory=max_memory,
    no_split_module_classes=["GPTJBlock"], 
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

billy-ai

Dec 7, 2022

I'm using this code and inference still takes ~12 seconds. I use NVIDIA T4 x 2. For inference I use the command model.generate, do you know if I need to do anything else to make it use GPU?

Do you have a code snippet with an inference example, which uses GPU? :) That would be awesome.

Thanks for the good work!

juewang

Together org Dec 9, 2022

@billy-ai Sorry for the late reply. If you use this code, the inference should run on GPU.
-- How many tokens were you trying to generate? It's possible to be slow if max_new_tokens is large.

If you use T4 with 16GB VRAM, simply moving the model to GPUmodel = model.half().to('cuda:0') and calling output = model.generate(input_ids, max_new_tokens=10) are enough to GPU.

Ascendant

Jan 23, 2023

If I only have a 3070 with only 8 VRAM but has a lot of regular RAM (46) can I get away with running it on the CPU instead, don't mind if it's much slower?

juewang

Together org Jan 26, 2023

If I only have a 3070 with only 8 VRAM but has a lot of regular RAM (46) can I get away with running it on the CPU instead, don't mind if it's much slower?

Sure, you can run it on CPU without any problem. You can also try quantization: model = AutoModelForCausalLM.from_pretrained('togethercomputer/GPT-JT-6B-v1', device_map='auto', load_in_8bit=True, int8_threshold=6.0) :)

spartanml

Feb 4, 2023

from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory

# Load model to CPU
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["GPTJBlock"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model, 
    max_memory=max_memory,
    no_split_module_classes=["GPTJBlock"], 
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

Thanks! Sadly, won't be able to get another GPU soon!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment