RL Casino Models
Collection
Model checkpoints generated during an ongoing research effort into the acceleration potential and tuning quality of LLMs with RL fine tuning.
•
2 items
•
Updated
LLaMA 3.1 8B fine tuned on Light R1 DPO dataset for 100 steps
transformers >= 4.43.0 (required for full Llama 3.1 support)torch (recommended: torch >= 2.0.0)pip install --upgrade transformers torch
Starting with transformers >= 4.43.0, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.
import transformers
import torch
model_id = "ScottBiggs2/LLaMA-3.1-8B-Instruct-DPO-Baseline"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what machine learning is."},
]
outputs = pipeline(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ScottBiggs2/LLaMA-3.1-8B-Instruct-DPO-Baseline"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what machine learning is."},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
Llama 3.1 supports tool use through chat templates in Transformers. See the official documentation for detailed examples.
This model is based on Meta's Llama 3.1 8B Instruct model, fine-tuned using Direct Preference Optimization (DPO). The model maintains compatibility with the original Llama 3.1 architecture and chat template format.
For more information about the base model, see:
If you use this model, please cite the original Llama 3.1 paper:
@article{{meta2024llama,
title={{Llama 3.1}},
author={{Meta AI}},
year={{2024}}
}}
Base model
meta-llama/Llama-3.1-8B