Instructions to use Yukang/LongAlpaca-70B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Yukang/LongAlpaca-70B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Yukang/LongAlpaca-70B")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Yukang/LongAlpaca-70B") model = AutoModelForCausalLM.from_pretrained("Yukang/LongAlpaca-70B") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Yukang/LongAlpaca-70B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Yukang/LongAlpaca-70B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Yukang/LongAlpaca-70B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Yukang/LongAlpaca-70B
- SGLang
How to use Yukang/LongAlpaca-70B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Yukang/LongAlpaca-70B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Yukang/LongAlpaca-70B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Yukang/LongAlpaca-70B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Yukang/LongAlpaca-70B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Yukang/LongAlpaca-70B with Docker Model Runner:
docker model run hf.co/Yukang/LongAlpaca-70B
Any chance of a 16K model?
Thank you for your wonderful work. Unfortunately, on Linux, even with FA2, the 128g GPTQ version of this model cannot be loaded at 32K context with 2x3090s. Are there any plans to train a 16K version that would be useable for a broader audience? Truncating max_seq_length to 16K on load seems to degrade performance. I'm going to quantize in EXL2 format so that I can load at 32K, but it will mean a very low bit-rate.
If you use rope scaling = 8 and max_seq_len = 16K it should perform like a 16K model (make sure you don't use rope scaling = 4). It flat out beats any 16K fine-tune I've made on raw perplexity at 16K. Maybe with a lot more training at rope scaling = 4 an exclusive 16K model might do better? But I don't think that's worth that much - the PPL drops monotonically all the way to 32K at rope scaling = 8.
Interesting, thank you. I'll give that a go. I was trying 16K at a scaling factor of 4.