Instructions to use ByteDance-Seed/Seed-OSS-36B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ByteDance-Seed/Seed-OSS-36B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ByteDance-Seed/Seed-OSS-36B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ByteDance-Seed/Seed-OSS-36B-Instruct") model = AutoModelForCausalLM.from_pretrained("ByteDance-Seed/Seed-OSS-36B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ByteDance-Seed/Seed-OSS-36B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ByteDance-Seed/Seed-OSS-36B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance-Seed/Seed-OSS-36B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ByteDance-Seed/Seed-OSS-36B-Instruct
- SGLang
How to use ByteDance-Seed/Seed-OSS-36B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ByteDance-Seed/Seed-OSS-36B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance-Seed/Seed-OSS-36B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ByteDance-Seed/Seed-OSS-36B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance-Seed/Seed-OSS-36B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ByteDance-Seed/Seed-OSS-36B-Instruct with Docker Model Runner:
docker model run hf.co/ByteDance-Seed/Seed-OSS-36B-Instruct
vram Requirements for full size
Anyone know how much vram we need to run this? Will it run ok on ada 6000 48gb with 100k-ish context?
Full 36B BF16 model is about 70GB in itself, so it won't fit, but a quant will fit. I am running 4.22bpw EXL3 quant with 150k Q8 context on 2x 3090 Ti 24GB with tensor parallel, it works alright. You can try this, or GGUF quants up to q6_k, sglang/vllm with FP8 , GPTQ-8bit or AWQ/GPTQ 4-bit quants. With 4.22bpw EXL3 and Q4 KV cache you should be able to push up to 300-350k context. I've had normal chats with it till about 120k ctx so far and it was perfectly stable.
Anyone know how much vram we need to run this? Will it run ok on ada 6000 48gb with 100k-ish context?
I can fit 100K context in 24GB with exllamav3.
In fact, you'd have room to batch calls in parallel with 48GB if you wish.