Instructions to use TheBloke/Falcon-7B-Instruct-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TheBloke/Falcon-7B-Instruct-GPTQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TheBloke/Falcon-7B-Instruct-GPTQ", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("TheBloke/Falcon-7B-Instruct-GPTQ", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TheBloke/Falcon-7B-Instruct-GPTQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TheBloke/Falcon-7B-Instruct-GPTQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/Falcon-7B-Instruct-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TheBloke/Falcon-7B-Instruct-GPTQ
- SGLang
How to use TheBloke/Falcon-7B-Instruct-GPTQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TheBloke/Falcon-7B-Instruct-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/Falcon-7B-Instruct-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TheBloke/Falcon-7B-Instruct-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/Falcon-7B-Instruct-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TheBloke/Falcon-7B-Instruct-GPTQ with Docker Model Runner:
docker model run hf.co/TheBloke/Falcon-7B-Instruct-GPTQ
Do you know anything about this error?
hello sorry to bother you have some knowledge about a bug that now happens in oobabooga when updating it with bitsandbytes 0.39.0 now i cant use rtx 3090 gpu i thought it was windows installer so i download it again and do a clean install but the error still persists.
bin A:\LLMs_LOCAL\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.dll
R:\LLMs_LOCAL\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\cextension.py:34: User Warning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiply, and GPU quantization are not available.
warn("The installed version of bitsandbytes was compiled without GPU support. "
function 'cadam32bit_grad_fp32' not found
Hmm sorry I'm not quite sure why you're getting that. The text-generation-webui installation should take care of that
Maybe ask on the text-generation-webui Reddit page or Discord? I'm afraid I've never seen that error before
Ahh thanks for letting me know!
It's working perfectly fine for me on Win 11, RTX4090 and bitsandbytes 0.39.0. @RedXeol : support for Falcon was just merged into main branch. Feel free to git pull ;)
Red try this tutorial. https://agi-sphere.com/text-generation-webui-windows It is highly specific to Windows and text-generation-webui
I nearly pulled my hair out trying to get past it. I got that error every try until I discovered the Windows pre-built wheels for Bits & Bytes.
Good luck!
