Instructions to use togethercomputer/StripedHyena-Nous-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use togethercomputer/StripedHyena-Nous-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="togethercomputer/StripedHyena-Nous-7B", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("togethercomputer/StripedHyena-Nous-7B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use togethercomputer/StripedHyena-Nous-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "togethercomputer/StripedHyena-Nous-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/StripedHyena-Nous-7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/togethercomputer/StripedHyena-Nous-7B

SGLang

How to use togethercomputer/StripedHyena-Nous-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "togethercomputer/StripedHyena-Nous-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/StripedHyena-Nous-7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "togethercomputer/StripedHyena-Nous-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/StripedHyena-Nous-7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use togethercomputer/StripedHyena-Nous-7B with Docker Model Runner:
```
docker model run hf.co/togethercomputer/StripedHyena-Nous-7B
```

Error load Model

by NickyNicky - opened Dec 9, 2023

Discussion

NickyNicky

Dec 9, 2023

%%time
!pip install git+https://github.com/huggingface/transformers -qqq
# trl
!pip install git+https://github.com/huggingface/trl -qqq
!pip install datasets peft accelerate safetensors --upgrade -qqq
!pip install ninja packaging --upgrade -qqq
!pip install sentencepiece bitsandbytes -qqq
!pip install -U xformers deepspeed -qqq
!pip install attention_sinks -qqq
!python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"

!export CUDA_HOME=/usr/local/cuda-11.8
!MAX_JOBS=4 pip install flash-attn --no-build-isolation -qqq
# !pip install git+"https://github.com/Dao-AILab/flash-attention.git"

load model

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
    GenerationConfig,
    TextIteratorStreamer,
)
# from attention_sinks import AutoModelForCausalLM

import torch

model_ID_1="togethercomputer/StripedHyena-Nous-7B"

model = AutoModelForCausalLM.from_pretrained(model_ID_1,
                                             device_map="auto",
                                             trust_remote_code=True,
                                             torch_dtype=torch.bfloat16,
                                             use_flash_attention_2=True, # True , False
                                             low_cpu_mem_usage= True,
                                             )



max_length=2048 #get_max_length()
print("max_length",max_length)


tokenizer = AutoTokenizer.from_pretrained(model_ID_1,
                                          # use_fast = False, # True False
                                          max_length=max_length,)

Error:

NickyNicky

Dec 9, 2023

!pip install update setuptools wheel
!pip install git+"https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/layer_norm"

Zymrael

Dec 11, 2023

Can you share your environment? I repeated your steps and loaded the model correctly without any issues

Zymrael changed discussion status to closed Mar 11, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment