Instructions to use ToolBench/ToolLLaMA-2-7b-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ToolBench/ToolLLaMA-2-7b-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ToolBench/ToolLLaMA-2-7b-v2")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ToolBench/ToolLLaMA-2-7b-v2") model = AutoModelForCausalLM.from_pretrained("ToolBench/ToolLLaMA-2-7b-v2") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ToolBench/ToolLLaMA-2-7b-v2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ToolBench/ToolLLaMA-2-7b-v2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ToolBench/ToolLLaMA-2-7b-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ToolBench/ToolLLaMA-2-7b-v2
- SGLang
How to use ToolBench/ToolLLaMA-2-7b-v2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ToolBench/ToolLLaMA-2-7b-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ToolBench/ToolLLaMA-2-7b-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ToolBench/ToolLLaMA-2-7b-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ToolBench/ToolLLaMA-2-7b-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ToolBench/ToolLLaMA-2-7b-v2 with Docker Model Runner:
docker model run hf.co/ToolBench/ToolLLaMA-2-7b-v2
Error when running inference
#2
by fortiag - opened
I have been trying to do some tests using the following code:
tokenizer = AutoTokenizer.from_pretrained(model_hf_id, cache_dir=cache_path)
model = AutoModelForCausalLM.from_pretrained(model_hf_id, torch_dtype=torch.float16, cache_dir=cache_path)
test = pipeline(model=model, tokenizer=tokenizer)
example = "Hello, how are you?"
print("QUESTION: "+example)
result = test(example)
Unfortunately, when I run the code I get the following error (which seems not to be my fault):
Traceback (most recent call last):
File "/home/eve/Documents/llm-agent-poc/scripts/download_model_hf.py", line 24, in <module>
test = pipeline(model=model, tokenizer=tokenizer)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 801, in pipeline
raise RuntimeError(
RuntimeError: Inferring the task automatically requires to check the hub with a model_id defined as a `str`. LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
) is not a valid model_id.
I Googled the error and found a thread of someone reporting a similar problem with another model. The way to solve the error, they said, was to add a parameter to the pipeline:
test = pipeline(task='text-generation',model=model, tokenizer=tokenizer)
In my case it didn't solve the error, but generated another one:
Traceback (most recent call last):
File "/home/eve/Documents/llm-agent-poc/scripts/download_model_hf.py", line 28, in <module>
result = test(example)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 208, in __call__
return super().__call__(text_inputs, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1140, in __call__
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1147, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1046, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 271, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate
return self.greedy_search(
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
outputs = self(
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward
outputs = self.model(
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1068, in forward
layer_outputs = decoder_layer(
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 796, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 691, in forward
query_states = self.q_proj(hidden_states)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eve/Documents/llm-agent-poc/.venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'