If the response length exceeds 4096, is a sliding window used, or is it simply truncated?

by ShelterW - opened Jan 17, 2025

Jan 17, 2025

step_reward = make_step_rewards(logits, token_masks)

product_step_reward = 1.0  
    for reward in step_reward:
        product_step_reward *= reward

According to the paper, the score of each candidate response is calculated as the product of the individual scores of each step within the response. Then, how to weigh the difference between responses with fewer steps and those with more steps for the same question?

How can PRM@8 be combined with QwQ to demonstrate the best performance?

msra-jqxu

Jun 6, 2025

Same problem. I found that the sliding window was not enabled in the config, but when I entered 22k, there seemed to be no errors or warnings.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment