Instructions to use togethercomputer/GPT-JT-6B-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use togethercomputer/GPT-JT-6B-v1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="togethercomputer/GPT-JT-6B-v1")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use togethercomputer/GPT-JT-6B-v1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "togethercomputer/GPT-JT-6B-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/GPT-JT-6B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/togethercomputer/GPT-JT-6B-v1

SGLang

How to use togethercomputer/GPT-JT-6B-v1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "togethercomputer/GPT-JT-6B-v1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/GPT-JT-6B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "togethercomputer/GPT-JT-6B-v1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/GPT-JT-6B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use togethercomputer/GPT-JT-6B-v1 with Docker Model Runner:
```
docker model run hf.co/togethercomputer/GPT-JT-6B-v1
```

Confused about bidirectional attention when implementing custom sampling loop

#25

by ericanthonymitchell - opened Mar 5, 2023

Discussion

ericanthonymitchell

Mar 5, 2023

I'm trying to implement a custom sampling loop for GPT-JT, because I need some features not supported by model.generate. However, I'm a bit confused about how the bidirectional attention mask is tracked. Can someone point me to the code when GPT-JT bidirectional vs causal masking is controlled?

In this answer, @juewang mentions that the causal attention mask for GPT-JT is set to 1 by default. However, loading GPT-JT with transformers.AutoModelForCausalLM.from_pretrained just loads a normal GPT-J model, and the attention bias for GPT-J defaults to causal attention, as far as I can tell from here.

Could someone explain what I'm missing? I'm confused about how GPT-JT can implement custom attention masking, when there doesn't seem to be any GPT-JT-specific code in HuggingFace (just relying on GPT-J).

Thanks!

ericanthonymitchell

Mar 8, 2023

I was confused because I didn't realize that the attention_mask is actually a PyTorch registered buffer, i.e., part of the weights checkpoint; it's not controlled in code. The mask is in model.transformer.h[i].attn.bias.data[:].

My simple sampling loop looks like this, for reference:

def gptjt_sample(model, tokenizer, prompt_text, max_length=100, eos_token_id=None, do_sample=False):
    dev = list(model.parameters())[0].device
    input_ids = tokenizer(prompt_text, return_tensors='pt').input_ids.to(dev)
    past_key_values = None
    output_ids = input_ids
    for i in range(max_length):
        possibly_only_last_token = output_ids[:, -1:] if past_key_values is not None else output_ids
        outputs = model(possibly_only_last_token, use_cache=True, past_key_values=past_key_values, output_hidden_states=True)
        past_key_values = outputs.past_key_values
    
        next_token_logits = outputs.logits[:, -1, :]
        if do_sample:
            next_token = torch.multinomial(torch.softmax(next_token_logits, dim=-1), num_samples=1)
        else:
            next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
        output_ids = torch.cat([output_ids, next_token], dim=-1)
        if eos_token_id is not None and next_token == eos_token_id:
            break
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

ericanthonymitchell changed discussion status to closed Mar 8, 2023

juewang

Together org Mar 10, 2023

Yeah, you are right, attention_mask is a registered buffer and will be overwritten after loading the ckpt.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment