metadata
license: mit
datasets:
- ymoslem/Law-StackExchange
language:
- en
metrics:
- f1
base_model:
- google/gemma-2-2b
library_name: mlx
tags:
- legal
widget:
- text: |
<start_of_turn>user
## Instructions
You are a helpful AI assistant.
## User
How to make scrambled eggs?<end_of_turn>
<start_of_turn>model
shellzero/gemma2-2b-ft-law-data-tag-generation
This model was converted to MLX format from google/gemma-7b-it.
Refer to the original model card for more details on the model.
pip install mlx-lm
The model was LoRA fine-tuned on the ymoslem/Law-StackExchange and Synthetic data generated from
GPT-4o and GPT-35-Turbo using the format below, for 1500 steps using mlx.
This fine tune was one of the best runs with our data and achieved high F1 score on our eval dataset. (Part of the Nvidia hackathon)
def format_prompt(system_prompt: str, title: str, question: str) -> str:
"Format the question to the format of the dataset we fine-tuned to."
return """<bos><start_of_turn>user
## Instructions
{}
## User
TITLE:
{}
QUESTION:
{}<end_of_turn>
<start_of_turn>model
""".format(
system_prompt, title, question
)
Here's an example of the system_prompt from the dataset:
Read the following title and question about a legal issue and assign the most appropriate tag to it. All tags must be in lowercase, ordered lexicographically and separated by commas.
Loading the model using mlx_lm
from mlx_lm import generate, load
model, tokenizer = load("shellzero/gemma2-2b-ft-law-data-tag-generation")
response = generate(
model,
tokenizer,
prompt=format_prompt(system_prompt, question),
verbose=True, # Set to True to see the prompt and response
temp=0.0,
max_tokens=32,
)