BSC-LT
/

ALIA-40b-instruct

@@ -386,7 +386,7 @@ The alignment data was synthetically generated from a corpus of approximately 40
 Following approaches similar to UltraFeedback  and PKU, each instruction underwent the following process:
-1. Multiple responses were produced using a pool of permissively licensed models (see [Model Pool](#model-pool-for-synthetic-data-generation) on helpfulness or safety, depending on the prompt.
 2. These responses were rated by a judge (Deepseek-V3-0324). Helpfulness responses were given an overall rating, while safety responses were given a score based on their level of severity over a list of harm categories.
 3. Preference pairs were constructed from these ratings. This phase should be considered preliminary, as future versions of the model will incorporate human annotators to refine and curate the generation and evaluation pipeline.
@@ -403,8 +403,8 @@ The table below presents the distribution of helpfulness prompts by language, de
 | m-personas | 2674 | 1215 | 2852 | 2791 | 2530 | 12062 |
 | mentor-ca | 6517 | 0 | 0 | 0 | 0 | 6517 |
 | mentor-es | 0 | 0 | 6007 | 0 | 0 | 6007 |
-| new_open-orca | 0 | 15528 | 0 | 0 | 0 | 15528 |
-| no-robots-system-prompt | 0 | 5913 | 0 | 0 | 0 | 5913 |
 | oasst-ca | 2195 | 0 | 0 | 0 | 0 | 2195 |
 | open-math | 0 | 99995 | 0 | 0 | 0 | 99995 |
 | persona-generic | 8849 | 0 | 9464 | 8899 | 8588 | 35800 |
@@ -699,7 +699,7 @@ Current LM Evaluation Harness implementation is lacking correct pre-processing.
 To assess the long-context capabilities of our model, we performed a "needle in a haystack" test with the following configuration:
 - **Needle Phrase**: *"The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day."*
-- **System Prompt:** “You are a helpful AI bot that answers questions for a user. Keep your response short and direct”
 - **Retrieval Question**: *"What is the best thing to do in San Francisco?"*
 - **Evaluator**: [prometheus-8x7b-v2.0](https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0), used as the evaluation judge to determine whether the model correctly retrieved and utilized the long-context information.

 Following approaches similar to UltraFeedback  and PKU, each instruction underwent the following process:
+1. Multiple responses were produced using a pool of permissively licensed models (see [Model Pool](#model-pool-for-synthetic-data-generation)) on helpfulness or safety, depending on the prompt.
 2. These responses were rated by a judge (Deepseek-V3-0324). Helpfulness responses were given an overall rating, while safety responses were given a score based on their level of severity over a list of harm categories.
 3. Preference pairs were constructed from these ratings. This phase should be considered preliminary, as future versions of the model will incorporate human annotators to refine and curate the generation and evaluation pipeline.
 | m-personas | 2674 | 1215 | 2852 | 2791 | 2530 | 12062 |
 | mentor-ca | 6517 | 0 | 0 | 0 | 0 | 6517 |
 | mentor-es | 0 | 0 | 6007 | 0 | 0 | 6007 |
+| open-orca | 0 | 15528 | 0 | 0 | 0 | 15528 |
+| no-robots | 0 | 5913 | 0 | 0 | 0 | 5913 |
 | oasst-ca | 2195 | 0 | 0 | 0 | 0 | 2195 |
 | open-math | 0 | 99995 | 0 | 0 | 0 | 99995 |
 | persona-generic | 8849 | 0 | 9464 | 8899 | 8588 | 35800 |
 To assess the long-context capabilities of our model, we performed a "needle in a haystack" test with the following configuration:
 - **Needle Phrase**: *"The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day."*
+- **System Prompt:** *“You are a helpful AI bot that answers questions for a user. Keep your response short and direct”*
 - **Retrieval Question**: *"What is the best thing to do in San Francisco?"*
 - **Evaluator**: [prometheus-8x7b-v2.0](https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0), used as the evaluation judge to determine whether the model correctly retrieved and utilized the long-context information.