Instructions to use ToolBench/ToolLLaMA-2-7b-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ToolBench/ToolLLaMA-2-7b-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ToolBench/ToolLLaMA-2-7b-v2")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ToolBench/ToolLLaMA-2-7b-v2") model = AutoModelForCausalLM.from_pretrained("ToolBench/ToolLLaMA-2-7b-v2") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ToolBench/ToolLLaMA-2-7b-v2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ToolBench/ToolLLaMA-2-7b-v2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ToolBench/ToolLLaMA-2-7b-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ToolBench/ToolLLaMA-2-7b-v2
- SGLang
How to use ToolBench/ToolLLaMA-2-7b-v2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ToolBench/ToolLLaMA-2-7b-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ToolBench/ToolLLaMA-2-7b-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ToolBench/ToolLLaMA-2-7b-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ToolBench/ToolLLaMA-2-7b-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ToolBench/ToolLLaMA-2-7b-v2 with Docker Model Runner:
docker model run hf.co/ToolBench/ToolLLaMA-2-7b-v2
Enhancement Request: Model Sharding for ToolLLaMA-2-7b-v2 for Better Accessibility
Hello ToolBench Community,
I hope this message finds you well. I am reaching out with a suggestion that could significantly improve the accessibility of the ToolLLaMA-2-7b-v2 model for a broader audience. As it stands, running such large models requires high-spec hardware, which may not be accessible to all users.
To address this, I propose sharding the ToolLLaMA-2-7b-v2 model. Sharding would allow users with lower-spec PCs to run the model by dividing it into smaller, more manageable pieces that could be processed in parallel or sequentially with less strain on their systems.
Moreover, considering the growing popularity of cloud-based platforms like Google Colab and Kaggle, which provide limited but free access to powerful computational resources, model sharding could also enhance the user experience on these platforms. Users could leverage the distributed nature of sharded models to run experiments and larger workloads without encountering resource limitations that often come with free tiers.
By enabling model sharding, we could democratize access to state-of-the-art models, foster greater experimentation, and inclusivity within the community.
I would love to hear your thoughts on this proposal or any alternative solutions that could facilitate running large models on less powerful machines or within the resource constraints of popular cloud services.
Thank you for considering this enhancement.