sarosavo
/

Master-RM

Text Classification

text-generation

text-embeddings-inference

Model card Files Files and versions

sarosavo commited on Jul 14, 2025

Commit

42eeb37

·

verified ·

1 Parent(s): b6e8e5f

Update README.md

Files changed (1) hide show

README.md +27 -0

README.md CHANGED Viewed

@@ -93,6 +93,33 @@ Inputting the question, its ground-truth reference, and the response to be evalu
 > print("Model judgement: ",judgement)
 > ```
 ## Citation
 If you use this model, please cite:

 > print("Model judgement: ",judgement)
 > ```
+## Deploy reward model for RLVR training
+### launch a remote reward server with vllm
+The script below will launch a reward at http://127.0.0.1:8000/get_reward
+```bash
+bash reward_server/launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC}
+# MODEL_PATH: the path of our reward model.
+# ANSWER_PATH: the path of the training data.
+# METRIC: greedy/prob
+# This will launch a reward at http://127.0.0.1:8000/get_reward
+```
+# Start training
+```bash
+bash train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API}
+# METHOD:          advantage estimator, e.g., reinforce_baseline, reinforce, rloo
+# PRETRAIN_PATH:   path to the pretrained model, e.g., Qwen2.5-7B
+# DATA_PATH:       path to the QA data with which we want to perform RL reasoning
+# REWARD_API:      reward server host, e.g., http://127.0.0.1:8000/get_reward
+```
 ## Citation
 If you use this model, please cite: