Update README.md

#2
by dtamayo - opened
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -386,7 +386,7 @@ The alignment data was synthetically generated from a corpus of approximately 40
386
 
387
  Following approaches similar to UltraFeedback and PKU, each instruction underwent the following process:
388
 
389
- 1. Multiple responses were produced using a pool of permissively licensed models (see [Model Pool](#model-pool-for-synthetic-data-generation) on helpfulness or safety, depending on the prompt.
390
  2. These responses were rated by a judge (Deepseek-V3-0324). Helpfulness responses were given an overall rating, while safety responses were given a score based on their level of severity over a list of harm categories.
391
  3. Preference pairs were constructed from these ratings. This phase should be considered preliminary, as future versions of the model will incorporate human annotators to refine and curate the generation and evaluation pipeline.
392
 
@@ -403,8 +403,8 @@ The table below presents the distribution of helpfulness prompts by language, de
403
  | m-personas | 2674 | 1215 | 2852 | 2791 | 2530 | 12062 |
404
  | mentor-ca | 6517 | 0 | 0 | 0 | 0 | 6517 |
405
  | mentor-es | 0 | 0 | 6007 | 0 | 0 | 6007 |
406
- | new_open-orca | 0 | 15528 | 0 | 0 | 0 | 15528 |
407
- | no-robots-system-prompt | 0 | 5913 | 0 | 0 | 0 | 5913 |
408
  | oasst-ca | 2195 | 0 | 0 | 0 | 0 | 2195 |
409
  | open-math | 0 | 99995 | 0 | 0 | 0 | 99995 |
410
  | persona-generic | 8849 | 0 | 9464 | 8899 | 8588 | 35800 |
@@ -699,7 +699,7 @@ Current LM Evaluation Harness implementation is lacking correct pre-processing.
699
  To assess the long-context capabilities of our model, we performed a "needle in a haystack" test with the following configuration:
700
 
701
  - **Needle Phrase**: *"The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day."*
702
- - **System Prompt:** β€œYou are a helpful AI bot that answers questions for a user. Keep your response short and direct”
703
  - **Retrieval Question**: *"What is the best thing to do in San Francisco?"*
704
  - **Evaluator**: [prometheus-8x7b-v2.0](https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0), used as the evaluation judge to determine whether the model correctly retrieved and utilized the long-context information.
705
 
 
386
 
387
  Following approaches similar to UltraFeedback and PKU, each instruction underwent the following process:
388
 
389
+ 1. Multiple responses were produced using a pool of permissively licensed models (see [Model Pool](#model-pool-for-synthetic-data-generation)) on helpfulness or safety, depending on the prompt.
390
  2. These responses were rated by a judge (Deepseek-V3-0324). Helpfulness responses were given an overall rating, while safety responses were given a score based on their level of severity over a list of harm categories.
391
  3. Preference pairs were constructed from these ratings. This phase should be considered preliminary, as future versions of the model will incorporate human annotators to refine and curate the generation and evaluation pipeline.
392
 
 
403
  | m-personas | 2674 | 1215 | 2852 | 2791 | 2530 | 12062 |
404
  | mentor-ca | 6517 | 0 | 0 | 0 | 0 | 6517 |
405
  | mentor-es | 0 | 0 | 6007 | 0 | 0 | 6007 |
406
+ | open-orca | 0 | 15528 | 0 | 0 | 0 | 15528 |
407
+ | no-robots | 0 | 5913 | 0 | 0 | 0 | 5913 |
408
  | oasst-ca | 2195 | 0 | 0 | 0 | 0 | 2195 |
409
  | open-math | 0 | 99995 | 0 | 0 | 0 | 99995 |
410
  | persona-generic | 8849 | 0 | 9464 | 8899 | 8588 | 35800 |
 
699
  To assess the long-context capabilities of our model, we performed a "needle in a haystack" test with the following configuration:
700
 
701
  - **Needle Phrase**: *"The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day."*
702
+ - **System Prompt:** *β€œYou are a helpful AI bot that answers questions for a user. Keep your response short and direct”*
703
  - **Retrieval Question**: *"What is the best thing to do in San Francisco?"*
704
  - **Evaluator**: [prometheus-8x7b-v2.0](https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0), used as the evaluation judge to determine whether the model correctly retrieved and utilized the long-context information.
705