Instructions to use togethercomputer/GPT-JT-6B-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use togethercomputer/GPT-JT-6B-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="togethercomputer/GPT-JT-6B-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1") model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use togethercomputer/GPT-JT-6B-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "togethercomputer/GPT-JT-6B-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-JT-6B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/togethercomputer/GPT-JT-6B-v1
- SGLang
How to use togethercomputer/GPT-JT-6B-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "togethercomputer/GPT-JT-6B-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-JT-6B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "togethercomputer/GPT-JT-6B-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-JT-6B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use togethercomputer/GPT-JT-6B-v1 with Docker Model Runner:
docker model run hf.co/togethercomputer/GPT-JT-6B-v1
Hardware requirements for inference?
Where can I find the hardware requirements for this model? (Specifically, can it run on 3060/12GB)?
Theoretically, GPT-JT cannot run on one single 3060 12GB as the model itself takes up ~12GB and thus so there is not enough memory for inference. I'll recommend VRAM >= 16GB. An alternative is to use multiple 3060 GPUs with accelerate:
from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory
# Load model to CPU
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")
max_memory = get_balanced_memory(
model,
max_memory=None,
no_split_module_classes=["GPTJBlock"],
dtype='float16',
low_zero=False,
)
device_map = infer_auto_device_map(
model,
max_memory=max_memory,
no_split_module_classes=["GPTJBlock"],
dtype='float16'
)
model = dispatch_model(model, device_map=device_map)
I'm using this code and inference still takes ~12 seconds. I use NVIDIA T4 x 2. For inference I use the command model.generate, do you know if I need to do anything else to make it use GPU?
Do you have a code snippet with an inference example, which uses GPU? :) That would be awesome.
Thanks for the good work!
@billy-ai Sorry for the late reply. If you use this code, the inference should run on GPU.
-- How many tokens were you trying to generate? It's possible to be slow if max_new_tokens is large.
If you use T4 with 16GB VRAM, simply moving the model to GPUmodel = model.half().to('cuda:0') and calling output = model.generate(input_ids, max_new_tokens=10) are enough to GPU.
If I only have a 3070 with only 8 VRAM but has a lot of regular RAM (46) can I get away with running it on the CPU instead, don't mind if it's much slower?
If I only have a 3070 with only 8 VRAM but has a lot of regular RAM (46) can I get away with running it on the CPU instead, don't mind if it's much slower?
Sure, you can run it on CPU without any problem. You can also try quantization: model = AutoModelForCausalLM.from_pretrained('togethercomputer/GPT-JT-6B-v1', device_map='auto', load_in_8bit=True, int8_threshold=6.0) :)
Theoretically, GPT-JT cannot run on one single 3060 12GB as the model itself takes up ~12GB and thus so there is not enough memory for inference. I'll recommend VRAM >= 16GB. An alternative is to use multiple 3060 GPUs with
accelerate:from transformers import AutoTokenizer, AutoModelForCausalLM from accelerate import dispatch_model, infer_auto_device_map from accelerate.utils import get_balanced_memory # Load model to CPU tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1") model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1") max_memory = get_balanced_memory( model, max_memory=None, no_split_module_classes=["GPTJBlock"], dtype='float16', low_zero=False, ) device_map = infer_auto_device_map( model, max_memory=max_memory, no_split_module_classes=["GPTJBlock"], dtype='float16' ) model = dispatch_model(model, device_map=device_map)
Thanks! Sadly, won't be able to get another GPU soon!