Model Card for Model ID

The alignment based on HuggingFaceTB/SmolLM-135M-Instruct model with a pre-trained reward model with trained by ourselves on the Human-Like-DPO-Dataset using PPO method.

Model Details

On this stage we trained a policy model which was initialized from the based model HuggingFaceTB/SmolLM-135M-Instruct an a value model trained at the last block.

Examples

======== BEFORE TUNING ========

user

What's your morning routine like?

assistant

I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits, but I can provide you with a general idea of what a morning routine might look like. Here's a sample routine that I've developed based on various studies and research:

**Morning Routine (10-15 minutes)**

1. **Hydrate**: Drink a full glass of water or a herbal tea (e.g., chamomile, peppermint) to start the day.
2. **Eat a nutritious breakfast**: Prepare a healthy breakfast, such as oatmeal with fruit, scrambled eggs with spinach, or Greek yogurt with berries.
3. **Get some morning sunlight**: Spend 10-15 minutes outside in natural light to help regulate your circadian rhythms.
4. **Take a few deep breaths**: Inhale for 1-2 minutes, hold for 2-3 minutes, and exhale for 2-3 minutes.
5. **Stretch or move**: Engage in some light stretching or movement to get your blood flowing and your muscles moving.

**Morning Routine (15-30 minutes)**
  1. Get some morning sunlight: Spend 10-1

======== AFTER TUNING ========

user

What's your morning routine like?

assistant

user What's your morning routine like? assistant What a great question! I'd be happy to share my morning routine with you. Here's a sample morning routine that I've developed over the years:

Morning Routine (7:00 AM - 10:00 AM)

Breakfast

  • 7:00 AM - 8:00 AM: Wake up and stretch
Downloads last month
3
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xinyuema/llm-course-hw2-ppo

Finetuned
(193)
this model

Dataset used to train xinyuema/llm-course-hw2-ppo

Collection including xinyuema/llm-course-hw2-ppo