Instructions to use Trelis/openchat_3.5-function-calling-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Trelis/openchat_3.5-function-calling-v3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Trelis/openchat_3.5-function-calling-v3") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Trelis/openchat_3.5-function-calling-v3") model = AutoModelForCausalLM.from_pretrained("Trelis/openchat_3.5-function-calling-v3") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Trelis/openchat_3.5-function-calling-v3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Trelis/openchat_3.5-function-calling-v3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Trelis/openchat_3.5-function-calling-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Trelis/openchat_3.5-function-calling-v3
- SGLang
How to use Trelis/openchat_3.5-function-calling-v3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Trelis/openchat_3.5-function-calling-v3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Trelis/openchat_3.5-function-calling-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Trelis/openchat_3.5-function-calling-v3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Trelis/openchat_3.5-function-calling-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Trelis/openchat_3.5-function-calling-v3 with Docker Model Runner:
docker model run hf.co/Trelis/openchat_3.5-function-calling-v3
Looking for the right graphics card for this model
I'm looking:
EVGA GeForce RTX 3090 FTW3 Ultra Gaming, 24GB GDDR6X, iCX3 Technology, ARGB LED, Metal Backplate, 24 - pretty expensive...
Or this one, quite a bit less expensive...
RTX 4060 16Gb:
Powered by NVIDIA DLSS 3, ultra-efficient Ada Lovelace architechture, and full ray tracing
4th Generation Tensor Cores: Up to 4x performance with DLSS 3
3rd Generation RT Cores: Up to 2x ray tracing performance
Powered by GeForce RTX 4060 Ti
Integrated with 16GB GDDR6 128-bit memory interface
WINDFORCE Cooling System, Protection metal back plate
Graphics cards are super confusing to me since the manufacturers use these confusing designations all with different GBs and Tensors.
But since I'm putting together a build for it, do you have a recommendation?
I want the best function calling response and it is important that when no function is required, then it just answers. I won't be serving more than 10 concurrent requests and even if so it may cheaper to build more servers with lesser GB cards?
Lastly does the rest of the computer really matter? I'll probably have PCIe 3, DDR4 (32Gig) and an intel i7 9000.
Howdy! Base computer shouldn't matter too much.
If you go with a 16 GB GPU, then it will just about run a 7B model in 16 bit precision. But if you run Text Generation Inference then you can run in 8bit with eetq - which is fast and will half your memory requirement.
I haven't dug too deep nor run my own gpu at home (other than mac) but your cheaper choice seems ok. Check out localllama on reddit for info on GPUS: The LLM GPU Buying Guide - August 2023 : r/LocalLLaMA
This is very helpful. I've ordered a used 3090 from EBay.