Instructions to use togethercomputer/RedPajama-INCITE-7B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use togethercomputer/RedPajama-INCITE-7B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="togethercomputer/RedPajama-INCITE-7B-Instruct")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-7B-Instruct")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-7B-Instruct")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use togethercomputer/RedPajama-INCITE-7B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "togethercomputer/RedPajama-INCITE-7B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/RedPajama-INCITE-7B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/togethercomputer/RedPajama-INCITE-7B-Instruct

SGLang

How to use togethercomputer/RedPajama-INCITE-7B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "togethercomputer/RedPajama-INCITE-7B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/RedPajama-INCITE-7B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "togethercomputer/RedPajama-INCITE-7B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/RedPajama-INCITE-7B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use togethercomputer/RedPajama-INCITE-7B-Instruct with Docker Model Runner:
```
docker model run hf.co/togethercomputer/RedPajama-INCITE-7B-Instruct
```

Poor performance?

by Fionn - opened Jul 4, 2023

Discussion

Fionn

Jul 4, 2023

Saw some reports that this performs better than Falcon-7B. Was interested to try that out!

Unfortunately, using a handful of tests the performance seems quite poor. For example, the prompt: "Label the tweets as either 'positive', 'negative', 'mixed', or 'neutral': Tweet: I can say that there isn't anything I would change. " returns

"  Tweet: @jessicajayne haha yay for us!  Tweet: @gabrielladixon I have no idea how you do it.  Tweet: @jessicajayne haha yay for us!  Tweet: @gabrielladixon I have no idea how you do it.  Tweet: @gabrielladixon I have no idea how you do it.  Tweet: @gabrielladixon I have no idea how you do it.  Tweet: @gabrielladixon I"

Using the following parameters

"parameters": {
    "max_new_tokens": 128,
"temperature": 0.7, 
"top_p": 0.7, 
"top_k":50
  }

Is this expected or unexpected?

juewang

Together org Jul 4, 2023

@Fionn Thank you for your interests! In general it's not expected. I can offer some tips to help improve the performance of the model:

Always append "Label:", "Output:", or "Answer:" at the end of the prompt. This helps the model understand that it needs to provide the answer instead of completing the input tweet.
Feel free to use newlines to separate instructions, input, and output for better organization.

Based on these tips, you can format your input as follows:

Label the tweets as either 'positive', 'negative', 'mixed', or 'neutral'.

Tweet: I can say that there isn't anything I would change.
Label:

This formatting will prompt the model to provide a label for the given tweet, which should be positive.

Additionally, it would be very helpful to include examples to help the model better understand what you're looking for. For example:

Label the tweets as either 'positive', 'negative', 'mixed', or 'neutral'.

Tweet: The weather is good.
Label: positive

Tweet: I can say that there isn't anything I would change.
Label:

Fionn

Jul 4, 2023

Thanks for the quick and detailed response @juewang !

I tested with your feedback, but unfortunately, it's still quite poor. For me, inputting the second example you gave returns:

{'generated_text': "\n    Tweet: @marcus I'm glad you're feeling better!\n    Label: positive\n    \n    Tweet: @kylegriffin1  you're not going to be happy until you've turned the whole world into a bunch of democrats.\n    Label: negative\n    \n    Tweet: @TheTweetOfGod i don't have twitter, but i did read the article and it was very good!\n    Label: positive\n    \n    Tweet: @DjThunderLips I think you should try it. I've never been to one, but"}

For reference, I running this using the HuggingFace text inference docker

juewang

Together org Jul 4, 2023

@Fionn
The code snippet below works for me:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('togethercomputer/RedPajama-INCITE-7B-Instruct', torch_dtype=torch.float16).to('cuda:0')
tokenizer = AutoTokenizer.from_pretrained('togethercomputer/RedPajama-INCITE-7B-Instruct')

inputs = tokenizer("""Label the tweets as either 'positive', 'negative', 'mixed', or 'neutral'.

Tweet: The weather is good.
Label: positive

Tweet: I can say that there isn't anything I would change.
Label:""", return_tensors='pt').to(model.device)

output = model.generate(**inputs, max_new_tokens=32)[0, inputs.input_ids.size(1):]
print(tokenizer.decode(output))
# ==>
'''
 positive

Tweet: @jennifer_truax I'm so sorry.  I hope you feel better soon.  I'm glad you
'''

Can you check that there are no extra spaces or "\n" at the end of the prompt? They are very harmful for BPE tokenizer-based models.

Fionn

Jul 4, 2023

Thanks @juewang , as I said I'm using the inference API. I haven't been able to reproduce your results, but I'll keep trying. Thanks for the support!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment