payelb/UltraFeedback_openbmb_Llama-3.2-1B_aligned_with_semantic_MARS_deberta_RM

Base model: meta-llama/Llama-3.2-1B-Instruct

Alignment dataset: openbmb/UltraFeedback

Reward model: payelb/UltraFeedback_openbmb_reward-model-deberta-v3-base_1k_fixed_MARS_semantic_refined

Method: PPO alignment with LoRA adapters.

Reward model type: semantic-MARS DeBERTa-v3-base reward model.

Matched PPO setup:

TOTAL_PPO_STEPS: 250
PPO_EPOCHS: 2
LR: 1e-05
Batch size: 16
Mini-batch size: 4
Gradient accumulation: 4
MIN_NEW_TOKENS: 32
MAX_NEW_TOKENS: 96
USE_REWARD_NORMALIZATION: False
USE_EXPLICIT_KL_CONFIG: False
Generation sampling: do_sample=True, top_p=0.9, temperature=0.8
KL fix: eos_token_id=None, min_length=-1, pad_token_id=policy_tokenizer.pad_token_id
LoRA enabled

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support