Instructions to use winddude/pb_lora_7b_v0.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use winddude/pb_lora_7b_v0.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="winddude/pb_lora_7b_v0.1")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("winddude/pb_lora_7b_v0.1", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use winddude/pb_lora_7b_v0.1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "winddude/pb_lora_7b_v0.1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "winddude/pb_lora_7b_v0.1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/winddude/pb_lora_7b_v0.1
- SGLang
How to use winddude/pb_lora_7b_v0.1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "winddude/pb_lora_7b_v0.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "winddude/pb_lora_7b_v0.1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "winddude/pb_lora_7b_v0.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "winddude/pb_lora_7b_v0.1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use winddude/pb_lora_7b_v0.1 with Docker Model Runner:
docker model run hf.co/winddude/pb_lora_7b_v0.1
Hardware Question - Single system or multiple?
Nice work - From the readme,
Training took ~30hrs on 5x3090s and used almost 23gb of vram on each. DDP was used for pytorch parallelism.
I think I know the answer to this, but given that you used DDP, does that mean that this was trained across multiple CPU's/systems?s (I am on the hunt for a motherboard/system that can support several GPU's in one system and was initially excited that you may have used such a system)
If this was on a single system, do you happen to know what motherboard/system specs were that support 5x3090's? And if not, then the search continues...
1 CPU, 1 system, multiple GPUs, it's server components, tyan S8030GM2NE, Epyc 7532 32core
DDP is pytorch's parallel implementation for multiple gpus or multiple systems, https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
Just a newbie question, why did it take so much resources to train 8bit 7B Lora? Usually in text generation webui I can do the same with 8GB vram batch size 1 . Is it do to the data size?
wow, only 8gb vram, that's impressive. There are a number of factors, one is the batch size, as well as gradient accumulation steps. I also used the adamw_bnb_8bit optimizer, and optimizer could effect vram usage... My sequence length was also 1000, and padded, so they would all be 1000 tokens... that contributes for sure. I'm not sure on the total datasize, I don't know if the entire dataset is loaded in vram. The higher lora_r and lora_alpha would also contribute to higher vram usage.
So to be full honest, I'm not 100% certain on everything, you're vram usage seems surprisingly low, and mine seemed surprisingly high.
Edit: update, dataset size seems to have zero effect on vram usage, which makes sense, because it's loaded in in batch_size