StableLM-1.6B Fine-Tuned & Aligned (SFT + GRPO)

This repository contains a LoRA-based aligned variant of StableLM-2-1.6B, trained through a complete post-pretraining pipeline consisting of Supervised Fine-Tuning (SFT) followed by GRPO-based alignment.

This is a learning / portfolio model demonstrating end-to-end post-pretraining, not a production-ready system.

Model Overview

Base model: stabilityai/stablelm-2-1_6b
Model type: PEFT / LoRA adapter
Parameters (base): ~1.64B
Trainable parameters (LoRA): ~30–60M (depending on adapter stack)
License: Apache 2.0

⚠️ Important: This repository contains LoRA adapter weights only (adapter_model.safetensors). You must load it on top of the base StableLM model.

Training Pipeline

Base StableLM-2-1.6B
        │
        ▼
Supervised Fine-Tuning (LoRA, UltraChat)
        │
        ▼
GRPO Alignment (Reward-guided, PKU-SafeRLHF)
        │
        ▼
Aligned LoRA Adapter (this repo)

Stage 1: Supervised Fine-Tuning (SFT)

Objective: Instruction-following capability
Method: LoRA-based fine-tuning
Dataset: HuggingFaceH4/ultrachat_200k
LoRA configuration:
- Applied to attention and MLP projections
- Parameter-efficient (base model frozen)

Stage 2: GRPO Alignment

Objective: Safety and preference alignment
Method: Group Relative Policy Optimization (GRPO)
Reward model: OpenAssistant/reward-model-deberta-v3-large
Dataset: PKU-Alignment/PKU-SafeRLHF

Training Metrics

The following metrics were tracked during GRPO alignment:

Train Loss: Stable around zero, indicating controlled policy updates
Reward: Increases steadily and saturates, showing successful alignment
Entropy: Decreases significantly, reflecting reduced policy randomness and convergence

Github: https://github.com/kunjcr2/end-to-end-post-pretraining

📊 Weights & Biases run:

https://wandb.ai/kunjcr2-dreamable/huggingface/runs/qds5kqi2?nw=nwuserkunjcr2

How to Load the Model

This model cannot be loaded standalone.

Correct loading

from transformers import AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "stabilityai/stablelm-2-1_6b",
    trust_remote_code=True
)

model = PeftModel.from_pretrained(
    base_model,
    "kunjcr2/stablelm-1.6b-finetuned-aligned"
)

The tokenizer included in this repository should be used to ensure consistency with training.

Intended Use

Demonstrating an end-to-end LLM post-pretraining pipeline
Research, experimentation, and educational purposes
Portfolio / reference implementation for SFT + alignment workflows

Limitations

Not production-grade
Limited training steps and data scale
Alignment quality depends on reward model biases
No extensive safety evaluation beyond training metrics

Citation

If you reference this work, please cite the base model and datasets:

StableLM-2-1.6B — Stability AI
UltraChat 200K — Hugging Face H4
PKU-SafeRLHF — PKU Alignment
OpenAssistant Reward Model

License

Apache 2.0. See the LICENSE file for details.

Downloads last month: -

Model tree for kunjcr2/stablelm-1.6b-finetuned-aligned

Base model

stabilityai/stablelm-2-1_6b

Adapter

(9)

this model

kunjcr2
/

stablelm-1.6b-finetuned-aligned