StableLM-1.6B Fine-Tuned & Aligned (SFT + GRPO)

This repository contains a LoRA-based aligned variant of StableLM-2-1.6B, trained through a complete post-pretraining pipeline consisting of Supervised Fine-Tuning (SFT) followed by GRPO-based alignment.

This is a learning / portfolio model demonstrating end-to-end post-pretraining, not a production-ready system.


Model Overview

  • Base model: stabilityai/stablelm-2-1_6b
  • Model type: PEFT / LoRA adapter
  • Parameters (base): ~1.64B
  • Trainable parameters (LoRA): ~30–60M (depending on adapter stack)
  • License: Apache 2.0

⚠️ Important: This repository contains LoRA adapter weights only (adapter_model.safetensors). You must load it on top of the base StableLM model.


Training Pipeline

Base StableLM-2-1.6B
        │
        ▼
Supervised Fine-Tuning (LoRA, UltraChat)
        │
        ▼
GRPO Alignment (Reward-guided, PKU-SafeRLHF)
        │
        ▼
Aligned LoRA Adapter (this repo)

Stage 1: Supervised Fine-Tuning (SFT)

  • Objective: Instruction-following capability

  • Method: LoRA-based fine-tuning

  • Dataset: HuggingFaceH4/ultrachat_200k

  • LoRA configuration:

    • Applied to attention and MLP projections
    • Parameter-efficient (base model frozen)

Stage 2: GRPO Alignment

  • Objective: Safety and preference alignment
  • Method: Group Relative Policy Optimization (GRPO)
  • Reward model: OpenAssistant/reward-model-deberta-v3-large
  • Dataset: PKU-Alignment/PKU-SafeRLHF

Training Metrics

The following metrics were tracked during GRPO alignment:

  • Train Loss: Stable around zero, indicating controlled policy updates
  • Reward: Increases steadily and saturates, showing successful alignment
  • Entropy: Decreases significantly, reflecting reduced policy randomness and convergence

Github: https://github.com/kunjcr2/end-to-end-post-pretraining

📊 Weights & Biases run:

Screenshot 2026-01-26 153715

Screenshot 2026-01-26 153743

Screenshot 2026-01-26 153645

https://wandb.ai/kunjcr2-dreamable/huggingface/runs/qds5kqi2?nw=nwuserkunjcr2


How to Load the Model

This model cannot be loaded standalone.

Correct loading

from transformers import AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "stabilityai/stablelm-2-1_6b",
    trust_remote_code=True
)

model = PeftModel.from_pretrained(
    base_model,
    "kunjcr2/stablelm-1.6b-finetuned-aligned"
)

The tokenizer included in this repository should be used to ensure consistency with training.


Intended Use

  • Demonstrating an end-to-end LLM post-pretraining pipeline
  • Research, experimentation, and educational purposes
  • Portfolio / reference implementation for SFT + alignment workflows

Limitations

  • Not production-grade
  • Limited training steps and data scale
  • Alignment quality depends on reward model biases
  • No extensive safety evaluation beyond training metrics

Citation

If you reference this work, please cite the base model and datasets:

  • StableLM-2-1.6B — Stability AI
  • UltraChat 200K — Hugging Face H4
  • PKU-SafeRLHF — PKU Alignment
  • OpenAssistant Reward Model

License

Apache 2.0. See the LICENSE file for details.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kunjcr2/stablelm-1.6b-finetuned-aligned

Adapter
(9)
this model

Datasets used to train kunjcr2/stablelm-1.6b-finetuned-aligned