Instructions to use botbotrobotics/CabraLlama3-8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use botbotrobotics/CabraLlama3-8b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="botbotrobotics/CabraLlama3-8b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("botbotrobotics/CabraLlama3-8b")
model = AutoModelForCausalLM.from_pretrained("botbotrobotics/CabraLlama3-8b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use botbotrobotics/CabraLlama3-8b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "botbotrobotics/CabraLlama3-8b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "botbotrobotics/CabraLlama3-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/botbotrobotics/CabraLlama3-8b

SGLang

How to use botbotrobotics/CabraLlama3-8b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "botbotrobotics/CabraLlama3-8b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "botbotrobotics/CabraLlama3-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "botbotrobotics/CabraLlama3-8b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "botbotrobotics/CabraLlama3-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use botbotrobotics/CabraLlama3-8b with Docker Model Runner:
```
docker model run hf.co/botbotrobotics/CabraLlama3-8b
```

Parabéns pelo trabalho

by alexspf - opened Apr 24, 2024

Discussion

alexspf

Apr 24, 2024

Ótimo modelo para uso pessoal e estudar, fiz uns testes até de role-play ele faz bem sem começar a falar inglês do nada( comum no llama 3 normal, esse modelo tá bem intrínseco a falar pt-br mesmo). Pelo menos alguns testes de python que faço nos llama 3 em inglês, ele não degradou nada é da bem as respostas (até traduzindo o código para português sem você pedir no prompt, fazendo até os mesmos erros que o llama3 8b sem tá treinado na framework já faz).
Aparentemente o modelo não tá censurado ( o llama 3 puro em si é bem fácil enganar no prompt). É pra quem gosta de Tavern AI tá interessante já que ele lidou bem com meus personagens super aleatórios e que estavam com contexto em inglês (não sou aficionado pra testar aponto de ver se ele ia entrar no personagem).
Rodei ele com 8k de contexto na quantização q6. ( não achei na pagina de vcs, peguei pelo mradermacher, ele fez quantização imatrix que é legal, depois testo)
Novamente parabéns pelo trabalho a comunidade.

alexspf changed discussion status to closed Apr 24, 2024

nicolasdec

BotBot org Apr 24, 2024

Oi Alex, Tudo bem?

Muito obrigado pelo elogio! Fico feliz que gostou, estou rolando mais um treinamento para dar uma melhorada nele,. Depois da uma olhada nos outros modelos nossos. Vou fazer o Llama3 72b depois.

O Cabra 72b é incrível para roleplay.

Verdade!! Esqueci de incluir os modelos com quantização. Vou fazer amanha.

Grande abraço

nicolasdec changed discussion status to open Apr 24, 2024

cnmoro

May 1, 2024

Gostei do modelo, tem poucos erros de português, principalmente relacionados ao português de portugal.
Fiquei curioso, você fez o finetuning com qual método? QLoRA? Qual fator R utilizou ?

Valeu

nicolasdec

BotBot org May 3, 2024

Oi @cnmoro

Tudo bem? Quais erros você percebeu mais? Queria tentar resolver; Um que estava bem comum e acontecia com o nosso modelo Qwen também é usar milhas em vez de quilômetros. Era uma falha no nosso dataset, já corrigimos e nos próximos treinamentos (llama 3 72b) não devem ocorrer mais.

Geralmente fazemos um finetune completo, resultados melhores (custo maior também) que LoRA. Segue todos os parâmetros de treinamento:

Abcs

nicolasdec

BotBot org May 3, 2024

Model arguments

model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
model_revision: main
torch_dtype: bfloat16
use_flash_attention_2: true

Data training arguments

dataset_mixer:
/home/ubuntu/llm_finetune/alignment-handbook/merge_translate_21_04: 1.0
dataset_splits:

train
preprocessing_num_workers: 12

SFT trainer config

bf16: true
dataset_kwargs:
add_special_tokens: false # We already wrap and in the chat template
append_concat_token: false # No need to add across samples
do_eval: false
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 1.0e-05
log_level: info
logging_steps: 5
logging_strategy: steps
lr_scheduler_type: cosine
max_seq_length: 2048
max_steps: -1
num_train_epochs: 3
output_dir: llama3-8b-it-sft-v2
overwrite_output_dir: true
per_device_eval_batch_size: 4
per_device_train_batch_size: 4
#push_to_hub: true
remove_unused_columns: true
report_to:

wandb
save_strategy: "no"
seed: 42
warmup_ratio: 0.01

cnmoro

May 3, 2024

Assim que puder eu executo os testes novamente e coloco aqui os resultados.

Agradeço muito pelas informações !
Parabéns pelos modelos :)

nicolasdec changed discussion status to closed Jun 4, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment