Instructions to use togethercomputer/RedPajama-INCITE-7B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use togethercomputer/RedPajama-INCITE-7B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="togethercomputer/RedPajama-INCITE-7B-Instruct")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-7B-Instruct") model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-7B-Instruct") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use togethercomputer/RedPajama-INCITE-7B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "togethercomputer/RedPajama-INCITE-7B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/RedPajama-INCITE-7B-Instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/togethercomputer/RedPajama-INCITE-7B-Instruct
- SGLang
How to use togethercomputer/RedPajama-INCITE-7B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "togethercomputer/RedPajama-INCITE-7B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/RedPajama-INCITE-7B-Instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "togethercomputer/RedPajama-INCITE-7B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/RedPajama-INCITE-7B-Instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use togethercomputer/RedPajama-INCITE-7B-Instruct with Docker Model Runner:
docker model run hf.co/togethercomputer/RedPajama-INCITE-7B-Instruct
Poor performance?
Saw some reports that this performs better than Falcon-7B. Was interested to try that out!
Unfortunately, using a handful of tests the performance seems quite poor. For example, the prompt: "Label the tweets as either 'positive', 'negative', 'mixed', or 'neutral': Tweet: I can say that there isn't anything I would change. " returns
" Tweet: @jessicajayne haha yay for us! Tweet: @gabrielladixon I have no idea how you do it. Tweet: @jessicajayne haha yay for us! Tweet: @gabrielladixon I have no idea how you do it. Tweet: @gabrielladixon I have no idea how you do it. Tweet: @gabrielladixon I have no idea how you do it. Tweet: @gabrielladixon I"
Using the following parameters
"parameters": {
"max_new_tokens": 128,
"temperature": 0.7,
"top_p": 0.7,
"top_k":50
}
Is this expected or unexpected?
@Fionn Thank you for your interests! In general it's not expected. I can offer some tips to help improve the performance of the model:
- Always append "Label:", "Output:", or "Answer:" at the end of the prompt. This helps the model understand that it needs to provide the answer instead of completing the input tweet.
- Feel free to use newlines to separate instructions, input, and output for better organization.
Based on these tips, you can format your input as follows:
Label the tweets as either 'positive', 'negative', 'mixed', or 'neutral'.
Tweet: I can say that there isn't anything I would change.
Label:
This formatting will prompt the model to provide a label for the given tweet, which should be positive.
Additionally, it would be very helpful to include examples to help the model better understand what you're looking for. For example:
Label the tweets as either 'positive', 'negative', 'mixed', or 'neutral'.
Tweet: The weather is good.
Label: positive
Tweet: I can say that there isn't anything I would change.
Label:
Thanks for the quick and detailed response @juewang !
I tested with your feedback, but unfortunately, it's still quite poor. For me, inputting the second example you gave returns:
{'generated_text': "\n Tweet: @marcus I'm glad you're feeling better!\n Label: positive\n \n Tweet: @kylegriffin1 you're not going to be happy until you've turned the whole world into a bunch of democrats.\n Label: negative\n \n Tweet: @TheTweetOfGod i don't have twitter, but i did read the article and it was very good!\n Label: positive\n \n Tweet: @DjThunderLips I think you should try it. I've never been to one, but"}
For reference, I running this using the HuggingFace text inference docker
@Fionn
The code snippet below works for me:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('togethercomputer/RedPajama-INCITE-7B-Instruct', torch_dtype=torch.float16).to('cuda:0')
tokenizer = AutoTokenizer.from_pretrained('togethercomputer/RedPajama-INCITE-7B-Instruct')
inputs = tokenizer("""Label the tweets as either 'positive', 'negative', 'mixed', or 'neutral'.
Tweet: The weather is good.
Label: positive
Tweet: I can say that there isn't anything I would change.
Label:""", return_tensors='pt').to(model.device)
output = model.generate(**inputs, max_new_tokens=32)[0, inputs.input_ids.size(1):]
print(tokenizer.decode(output))
# ==>
'''
positive
Tweet: @jennifer_truax I'm so sorry. I hope you feel better soon. I'm glad you
'''
Can you check that there are no extra spaces or "\n" at the end of the prompt? They are very harmful for BPE tokenizer-based models.