Text Classification
Transformers
Safetensors
qwen2
text-generation
text-embeddings-inference
sarosavo commited on
Commit
42eeb37
·
verified ·
1 Parent(s): b6e8e5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -0
README.md CHANGED
@@ -93,6 +93,33 @@ Inputting the question, its ground-truth reference, and the response to be evalu
93
  > print("Model judgement: ",judgement)
94
  > ```
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  ## Citation
97
 
98
  If you use this model, please cite:
 
93
  > print("Model judgement: ",judgement)
94
  > ```
95
 
96
+ ## Deploy reward model for RLVR training
97
+
98
+ ### launch a remote reward server with vllm
99
+
100
+ The script below will launch a reward at http://127.0.0.1:8000/get_reward
101
+
102
+ ```bash
103
+ bash reward_server/launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC}
104
+
105
+ # MODEL_PATH: the path of our reward model.
106
+ # ANSWER_PATH: the path of the training data.
107
+ # METRIC: greedy/prob
108
+ # This will launch a reward at http://127.0.0.1:8000/get_reward
109
+ ```
110
+
111
+ # Start training
112
+
113
+ ```bash
114
+ bash train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API}
115
+
116
+ # METHOD: advantage estimator, e.g., reinforce_baseline, reinforce, rloo
117
+ # PRETRAIN_PATH: path to the pretrained model, e.g., Qwen2.5-7B
118
+ # DATA_PATH: path to the QA data with which we want to perform RL reasoning
119
+ # REWARD_API: reward server host, e.g., http://127.0.0.1:8000/get_reward
120
+ ```
121
+
122
+
123
  ## Citation
124
 
125
  If you use this model, please cite: