Instructions to use togethercomputer/GPT-JT-6B-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use togethercomputer/GPT-JT-6B-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="togethercomputer/GPT-JT-6B-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1") model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use togethercomputer/GPT-JT-6B-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "togethercomputer/GPT-JT-6B-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-JT-6B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/togethercomputer/GPT-JT-6B-v1
- SGLang
How to use togethercomputer/GPT-JT-6B-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "togethercomputer/GPT-JT-6B-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-JT-6B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "togethercomputer/GPT-JT-6B-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-JT-6B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use togethercomputer/GPT-JT-6B-v1 with Docker Model Runner:
docker model run hf.co/togethercomputer/GPT-JT-6B-v1
Feature requests and suggestions for V2
We are starting to work on V2 and would love to hear your suggestions and top requests!
Increase input sequences more than 2048 tokens
Can you train it also with Reinforcement Learning, like Open ai?
Sparse Upcycling might be cool to try! https://twitter.com/arankomatsuzaki/status/1602126140696629249?s=20&t=qnFaselW3mXcm-UZn7ISlA
Great work! Any timeline on when will V2 be available?
We are very interested in using GPT-JT for our BLIP-2 model: https://twitter.com/LiJunnan0409/status/1620259379223343107
From our current experiments, GPT-JT v1 outperforms OPT6.7B but still underperforms FLAN-T5
I have a rtx3090, how long should I expect for the model to load and respond if it's loaded locally? I was hoping the model would stay loaded like with Stable Diffusion so that I could continue to use it without having to reload it each time I call the program.
I have a rtx3090, how long should I expect for the model to load and respond if it's loaded locally? I was hoping the model would stay loaded like with Stable Diffusion so that I could continue to use it without having to reload it each time I call the program.
If you're using the from_pretrained function to load the model locally, it typically takes around 2-3 minutes -- most of this time is spent on random initialization. And sure you can keep it loaded so that you don't have to reload it each time.
The inference response time will depend on your generation configuration, particularly the max_new_tokens setting. Generally, the response time is linearly related to max_new_tokens. For most configurations, the response time is typically several seconds at most; if your expected response is short, you can set a small value to accelerate inference.