Update README.md
Browse files
README.md
CHANGED
|
@@ -37,6 +37,12 @@ We also use GRPO to train [Qwen-2.5-VL-7B](https://huggingface.co/Qwen/Qwen2.5-V
|
|
| 37 |
- RewardBert is specifically targeted for free-form GRPO training, where the answers cannot be evaluated based on simple correctness.
|
| 38 |
- We use [ModernBERT](https://huggingface.co/docs/transformers/en/model_doc/modernbert) as the base model to finetune on [MOCHA](https://arxiv.org/abs/2010.03636), [Prometheus-preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection), [Pedants](https://arxiv.org/abs/2402.11161) to evaluate free-form text generations. We use RewardBert as the reward in GRPO finetuning.
|
| 39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
#### Method: `compute_score`
|
| 41 |
**Parameters**
|
| 42 |
- `reference_answer` (list of str): A list of gold (correct) answers to the question
|
|
|
|
| 37 |
- RewardBert is specifically targeted for free-form GRPO training, where the answers cannot be evaluated based on simple correctness.
|
| 38 |
- We use [ModernBERT](https://huggingface.co/docs/transformers/en/model_doc/modernbert) as the base model to finetune on [MOCHA](https://arxiv.org/abs/2010.03636), [Prometheus-preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection), [Pedants](https://arxiv.org/abs/2402.11161) to evaluate free-form text generations. We use RewardBert as the reward in GRPO finetuning.
|
| 39 |
|
| 40 |
+
### Installation
|
| 41 |
+
```
|
| 42 |
+
## For more evaluation metrics, refer to https://github.com/zli12321/qa_metrics
|
| 43 |
+
pip install qa-metrics
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
#### Method: `compute_score`
|
| 47 |
**Parameters**
|
| 48 |
- `reference_answer` (list of str): A list of gold (correct) answers to the question
|