Instructions to use tiiuae/falcon-40b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiiuae/falcon-40b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tiiuae/falcon-40b", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use tiiuae/falcon-40b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiiuae/falcon-40b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/tiiuae/falcon-40b
- SGLang
How to use tiiuae/falcon-40b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-40b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-40b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use tiiuae/falcon-40b with Docker Model Runner:
docker model run hf.co/tiiuae/falcon-40b
Custom 4-bit Finetuning 5-7 times faster inference than QLora
Excuse me, some question for you..
- What is the different between your
falcontuneandQLoRA? - What is the different fine tuning (with the new dataset) in
Bitsandbytes+peftand your code? Or maybe your script is the simple form ofbitsandbytes+peft? - Can I activate 'nf4' (normal four bit float) in the
GPTQ?
Excuse me, some question for you..
I join in the questions!
Doesn't 40b require like 48Gb of VRAM? also if anyone reads this I would be very appreciative for any insight into cost efficient/realistic hardware for ML, it seems like the cheapest build is somewhere in the neighborhood of $5-6k, and I think I would rather have my own hardware than rely on Amazon/Google/Azure, Thanks
Falcon 40b inference in 8bit takes 45gb of ram. On single RTX A6000 48GB (not ADA version) on AMD EPIC 7713 DDR4 pc take around 4 second to generate 20 tokens (words), in 4bit -it takes 25gb ram and 12 second for same 20 tokens - not sure why..
...
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
PATH,
device_map="auto"
trust_remote_code=True,
quantization_config=bnb_config,
)
can anyone help me please
i have the text data stored in .txt the text data is simple information about a technology
i want to fine tune the falcon model and the i want to ask the question to the falcon model according to that .txt file
Falcon 40b inference in 8bit takes 45gb of ram. On single RTX A6000 48GB (not ADA version) on AMD EPIC 7713 DDR4 pc take around 4 second to generate 20 tokens (words), in 4bit -it takes 25gb ram and 12 second for same 20 tokens - not sure why..
I would also love to know why it takes so long.
My main reason, (and I suspect many people's) main use case for GPT alternatives include both open source AND hopefully faster speed. Reducing the memory profile but increasing the lag seems like a lateral move.