abbatea commited on
Commit
43dced7
·
verified ·
1 Parent(s): 7a0f196

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -33
README.md CHANGED
@@ -12,38 +12,49 @@ licence: license
12
  # Model Card for Llama_DPO_lora
13
 
14
  This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
15
- It has been trained using [TRL](https://github.com/huggingface/trl).
 
16
 
17
- ## Quick start
 
 
 
18
 
19
- ```python
20
- from transformers import pipeline
21
 
22
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
- generator = pipeline("text-generation", model="None", device="cuda")
24
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
- print(output["generated_text"])
26
- ```
27
 
28
- ## Training procedure
 
 
 
 
29
 
30
-
31
 
 
 
 
32
 
33
- This model was trained with DPO, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).
 
 
34
 
35
- ### Framework versions
 
 
36
 
37
- - TRL: 0.22.2
38
- - Transformers: 4.56.1
39
- - Pytorch: 2.8.0
40
- - Datasets: 4.0.0
41
- - Tokenizers: 0.22.0
42
 
43
- ## Citations
 
 
 
44
 
45
- Cite DPO as:
46
 
 
47
  ```bibtex
48
  @inproceedings{rafailov2023direct,
49
  title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
@@ -54,16 +65,3 @@ Cite DPO as:
54
  editor = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
55
  }
56
  ```
57
-
58
- Cite TRL as:
59
-
60
- ```bibtex
61
- @misc{vonwerra2022trl,
62
- title = {{TRL: Transformer Reinforcement Learning}},
63
- author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
64
- year = 2020,
65
- journal = {GitHub repository},
66
- publisher = {GitHub},
67
- howpublished = {\url{https://github.com/huggingface/trl}}
68
- }
69
- ```
 
12
  # Model Card for Llama_DPO_lora
13
 
14
  This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
15
+ It has been trained using [DPO](https://huggingface.co/docs/trl/en/dpo_trainer).
16
+ The dataset it was build upon is a combination on **[MathDial dataset](https://huggingface.co/datasets/eth-nlped/mathdial-chat/viewer/default/train?views%5B%5D=train&row=0)** and generated model responses using MathDial as an input.
17
 
18
+ The model is optimized for:
19
+ - Conversational math problem solving
20
+ - Step-by-step reasoning in dialogue form
21
+ - Scaffolding
22
 
23
+ Repository: **[Github code DPO Training and Datasets](https://github.com/abbatea/MathDial-SFT-and-DPO/tree/main/DPO_Finetuning)**
 
24
 
25
+ ---
 
 
 
 
26
 
27
+ ## Intended Use
28
+ This model is intended for use in:
29
+ - Interactive math tutoring
30
+ - Research in dialogue-based problem solving
31
+ - Educational tools
32
 
33
+ ---
34
 
35
+ ## Example Usage
36
+ ```python
37
+ from transformers import AutoModelForCausalLM, AutoTokenizer
38
 
39
+ model_name = "abbateaa/Tutorbot-variation-DPO-Llama"
40
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
41
+ model = AutoModelForCausalLM.from_pretrained(model_name)
42
 
43
+ messages= """Student: I need help solving this problem.
44
+ Problem: Sarah has 3 apples. She buys 2 more. How many apples does she have?
45
+ Tutor:"""
46
 
47
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False)
48
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 
 
 
49
 
50
+ outputs = model.generate(**inputs, max_new_tokens=512)
51
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
52
+ ```
53
+ ---
54
 
55
+ ## Citations
56
 
57
+ DPO:
58
  ```bibtex
59
  @inproceedings{rafailov2023direct,
60
  title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
 
65
  editor = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
66
  }
67
  ```