Instructions to use Trelis/openchat_3.5-function-calling-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Trelis/openchat_3.5-function-calling-v3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Trelis/openchat_3.5-function-calling-v3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Trelis/openchat_3.5-function-calling-v3")
model = AutoModelForCausalLM.from_pretrained("Trelis/openchat_3.5-function-calling-v3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Trelis/openchat_3.5-function-calling-v3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Trelis/openchat_3.5-function-calling-v3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Trelis/openchat_3.5-function-calling-v3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Trelis/openchat_3.5-function-calling-v3

SGLang

How to use Trelis/openchat_3.5-function-calling-v3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Trelis/openchat_3.5-function-calling-v3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Trelis/openchat_3.5-function-calling-v3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Trelis/openchat_3.5-function-calling-v3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Trelis/openchat_3.5-function-calling-v3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Trelis/openchat_3.5-function-calling-v3 with Docker Model Runner:
```
docker model run hf.co/Trelis/openchat_3.5-function-calling-v3
```

Looking for the right graphics card for this model

by kjhamilton - opened Jan 4, 2024

Discussion

kjhamilton

Jan 4, 2024

I'm looking:
EVGA GeForce RTX 3090 FTW3 Ultra Gaming, 24GB GDDR6X, iCX3 Technology, ARGB LED, Metal Backplate, 24 - pretty expensive...

Or this one, quite a bit less expensive...

RTX 4060 16Gb:
Powered by NVIDIA DLSS 3, ultra-efficient Ada Lovelace architechture, and full ray tracing
4th Generation Tensor Cores: Up to 4x performance with DLSS 3
3rd Generation RT Cores: Up to 2x ray tracing performance
Powered by GeForce RTX 4060 Ti
Integrated with 16GB GDDR6 128-bit memory interface
WINDFORCE Cooling System, Protection metal back plate

Graphics cards are super confusing to me since the manufacturers use these confusing designations all with different GBs and Tensors.

But since I'm putting together a build for it, do you have a recommendation?

I want the best function calling response and it is important that when no function is required, then it just answers. I won't be serving more than 10 concurrent requests and even if so it may cheaper to build more servers with lesser GB cards?

Lastly does the rest of the computer really matter? I'll probably have PCIe 3, DDR4 (32Gig) and an intel i7 9000.

RonanMcGovern

Trelis org Jan 5, 2024

Howdy! Base computer shouldn't matter too much.

If you go with a 16 GB GPU, then it will just about run a 7B model in 16 bit precision. But if you run Text Generation Inference then you can run in 8bit with eetq - which is fast and will half your memory requirement.

I haven't dug too deep nor run my own gpu at home (other than mac) but your cheaper choice seems ok. Check out localllama on reddit for info on GPUS: The LLM GPU Buying Guide - August 2023 : r/LocalLLaMA

kjhamilton

Jan 8, 2024

This is very helpful. I've ordered a used 3090 from EBay.

kjhamilton changed discussion status to closed Jan 8, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment