Instructions to use togethercomputer/GPT-NeoXT-Chat-Base-20B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use togethercomputer/GPT-NeoXT-Chat-Base-20B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="togethercomputer/GPT-NeoXT-Chat-Base-20B")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-NeoXT-Chat-Base-20B") model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-NeoXT-Chat-Base-20B") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use togethercomputer/GPT-NeoXT-Chat-Base-20B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "togethercomputer/GPT-NeoXT-Chat-Base-20B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-NeoXT-Chat-Base-20B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/togethercomputer/GPT-NeoXT-Chat-Base-20B
- SGLang
How to use togethercomputer/GPT-NeoXT-Chat-Base-20B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "togethercomputer/GPT-NeoXT-Chat-Base-20B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-NeoXT-Chat-Base-20B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "togethercomputer/GPT-NeoXT-Chat-Base-20B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-NeoXT-Chat-Base-20B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use togethercomputer/GPT-NeoXT-Chat-Base-20B with Docker Model Runner:
docker model run hf.co/togethercomputer/GPT-NeoXT-Chat-Base-20B
Will it be possible to run this on PC with 8 GeForce RTX 3060 with 8 Gb VRAM each?
#11
by ai2p - opened
Can it correctly span VRAM between many GPU cards? Or it needs to have all required VRAM in one videocard only?
Yes
@ai2p Sure you can! Here is an example to load model across multiple devices (need to install accelerate first):
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
from accelerate.utils import get_balanced_memory, infer_auto_device_map
from huggingface_hub import snapshot_download
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM
import torch
def load_model(model_name):
weights_path = snapshot_download(model_name)
config = AutoConfig.from_pretrained(model_name)
# This will init model with meta tensors, which basically does nothing.
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
max_memory = get_balanced_memory(
model,
max_memory=None,
no_split_module_classes=["GPTNeoXLayer"],
dtype='float16',
low_zero=False,
)
device_map = infer_auto_device_map(
model,
max_memory=max_memory,
no_split_module_classes=["GPTNeoXLayer"],
dtype='float16'
)
model = load_checkpoint_and_dispatch(
model, weights_path, device_map=device_map, no_split_module_classes=["GPTNeoXLayer"]
)
return model
model_name = 'togethercomputer/GPT-NeoXT-Chat-Base-20B'
model = load_model(model_name)