Update README.md
Browse files
README.md
CHANGED
|
@@ -24,6 +24,7 @@ pipeline_tag: text-generation
|
|
| 24 |
* [Training Details](#training-details)
|
| 25 |
* [Training Data](#training-data)
|
| 26 |
* [Preprocessing](#preprocessing)
|
|
|
|
| 27 |
* [Training Procedure](#training-procedure)
|
| 28 |
<!-- * [Evaluation](#evaluation) -->
|
| 29 |
* [Testing the model](#testing-the-model)
|
|
@@ -64,6 +65,33 @@ And the following datasets developed for the Lucie instruct models:
|
|
| 64 |
* Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only languages on which Lucie-7B was trained.
|
| 65 |
* Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
### Training procedure
|
| 68 |
|
| 69 |
The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions:
|
|
|
|
| 24 |
* [Training Details](#training-details)
|
| 25 |
* [Training Data](#training-data)
|
| 26 |
* [Preprocessing](#preprocessing)
|
| 27 |
+
* [Instruction template](#instruction-template)
|
| 28 |
* [Training Procedure](#training-procedure)
|
| 29 |
<!-- * [Evaluation](#evaluation) -->
|
| 30 |
* [Testing the model](#testing-the-model)
|
|
|
|
| 65 |
* Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only languages on which Lucie-7B was trained.
|
| 66 |
* Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
|
| 67 |
|
| 68 |
+
### Instruction template:
|
| 69 |
+
Lucie-7B-Instruct-human-data was trained on the chat template from Llama 3.1 with the sole difference that `<|begin_of_text|>` is replaced with `<s>`. The resulting template:
|
| 70 |
+
|
| 71 |
+
```
|
| 72 |
+
<s><|start_header_id|>system<|end_header_id|>
|
| 73 |
+
|
| 74 |
+
{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>
|
| 75 |
+
|
| 76 |
+
{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
|
| 77 |
+
|
| 78 |
+
{OUTPUT}<|eot_id|>
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
An example:
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
```
|
| 86 |
+
<s><|start_header_id|>system<|end_header_id|>
|
| 87 |
+
|
| 88 |
+
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
|
| 89 |
+
|
| 90 |
+
Give me three tips for staying in shape.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
|
| 91 |
+
|
| 92 |
+
1. Eat a balanced diet and be sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.<|eot_id|>
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
### Training procedure
|
| 96 |
|
| 97 |
The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions:
|