Instructions to use v2ray/dbrx-base-fixed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use v2ray/dbrx-base-fixed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="v2ray/dbrx-base-fixed", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("v2ray/dbrx-base-fixed", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("v2ray/dbrx-base-fixed", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use v2ray/dbrx-base-fixed with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "v2ray/dbrx-base-fixed" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "v2ray/dbrx-base-fixed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/v2ray/dbrx-base-fixed
- SGLang
How to use v2ray/dbrx-base-fixed with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "v2ray/dbrx-base-fixed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "v2ray/dbrx-base-fixed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "v2ray/dbrx-base-fixed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "v2ray/dbrx-base-fixed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use v2ray/dbrx-base-fixed with Docker Model Runner:
docker model run hf.co/v2ray/dbrx-base-fixed
FlashAttention2 support not working during training
@Qubitium Yes, I am able to use flash attention 2, I'm not doing full fine-tune tho, it's a LoRA tune I tested.
https://github.com/LagPixelLOL/qlora/blob/main/scripts/finetune_schizogpt_132b.sh
This is the script I used to test, with eval disabled for DeepSpeed to work.
https://huggingface.co/v2ray/SchizoGPT-132B-QLoRA
This is the result of the training run.
Not like the name suggested, it's actually just a regular LoRA instead of QLoRA because I set the bits to 16. I trained it on 8x A100 80GB.
All the libraries I used are at the latest release version(Not the dev version), CUDA version I used is 12.2.
Hey v2ray,
thank you for the conversion.
I'm using TRL for finetuning and I'm getting stuck on the target_modules for PEFT, in the repo you forwarded there's a function to extract all linear layers but I get an error
Which modules did you use?
@ChristianPalaArtificialy Hello, I'm using:
"target_modules": [
"v1",
"Wqkv",
"layer",
"out_proj",
"w1",
"w2"
],
Also what's the error you were getting?