Instructions to use v2ray/dbrx-base-fixed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use v2ray/dbrx-base-fixed with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="v2ray/dbrx-base-fixed", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("v2ray/dbrx-base-fixed", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("v2ray/dbrx-base-fixed", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use v2ray/dbrx-base-fixed with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "v2ray/dbrx-base-fixed"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "v2ray/dbrx-base-fixed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/v2ray/dbrx-base-fixed

SGLang

How to use v2ray/dbrx-base-fixed with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "v2ray/dbrx-base-fixed" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "v2ray/dbrx-base-fixed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "v2ray/dbrx-base-fixed" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "v2ray/dbrx-base-fixed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use v2ray/dbrx-base-fixed with Docker Model Runner:
```
docker model run hf.co/v2ray/dbrx-base-fixed
```

FlashAttention2 support not working during training

by Qubitium - opened Mar 30, 2024

Discussion

Qubitium

Mar 30, 2024

@v2ray First of all huge thanks for the first working model that I can actually full-finetune with bfloat16. However the vram usage is insane due to flash attention2 in-compatibility. Are you able to train (bfloat16) with flash attention 2 enabled on your end?

v2ray

Owner Mar 30, 2024

•

edited Mar 30, 2024

@Qubitium Yes, I am able to use flash attention 2, I'm not doing full fine-tune tho, it's a LoRA tune I tested.
https://github.com/LagPixelLOL/qlora/blob/main/scripts/finetune_schizogpt_132b.sh
This is the script I used to test, with eval disabled for DeepSpeed to work.
https://huggingface.co/v2ray/SchizoGPT-132B-QLoRA
This is the result of the training run.
Not like the name suggested, it's actually just a regular LoRA instead of QLoRA because I set the bits to 16. I trained it on 8x A100 80GB.
All the libraries I used are at the latest release version(Not the dev version), CUDA version I used is 12.2.

ChristianPalaArtificialy

Mar 30, 2024

•

edited Mar 30, 2024

Hey v2ray,
thank you for the conversion.
I'm using TRL for finetuning and I'm getting stuck on the target_modules for PEFT, in the repo you forwarded there's a function to extract all linear layers but I get an error
Which modules did you use?

Qubitium

Mar 30, 2024

@v2ray The Fa2 bug I encountered was caused by my own custom training code. Sorry for the false alarm. =P

Qubitium changed discussion status to closed Mar 30, 2024

v2ray

Owner Mar 30, 2024

@ChristianPalaArtificialy Hello, I'm using:

  "target_modules": [
    "v1",
    "Wqkv",
    "layer",
    "out_proj",
    "w1",
    "w2"
  ],

Also what's the error you were getting?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment