StableLM-1.6B Fine-Tuned & Aligned (SFT + GRPO)
This repository contains a LoRA-based aligned variant of StableLM-2-1.6B, trained through a complete post-pretraining pipeline consisting of Supervised Fine-Tuning (SFT) followed by GRPO-based alignment.
This is a learning / portfolio model demonstrating end-to-end post-pretraining, not a production-ready system.
Model Overview
- Base model:
stabilityai/stablelm-2-1_6b - Model type: PEFT / LoRA adapter
- Parameters (base): ~1.64B
- Trainable parameters (LoRA): ~30–60M (depending on adapter stack)
- License: Apache 2.0
⚠️ Important:
This repository contains LoRA adapter weights only (adapter_model.safetensors).
You must load it on top of the base StableLM model.
Training Pipeline
Base StableLM-2-1.6B
│
▼
Supervised Fine-Tuning (LoRA, UltraChat)
│
▼
GRPO Alignment (Reward-guided, PKU-SafeRLHF)
│
▼
Aligned LoRA Adapter (this repo)
Stage 1: Supervised Fine-Tuning (SFT)
Objective: Instruction-following capability
Method: LoRA-based fine-tuning
Dataset:
HuggingFaceH4/ultrachat_200kLoRA configuration:
- Applied to attention and MLP projections
- Parameter-efficient (base model frozen)
Stage 2: GRPO Alignment
- Objective: Safety and preference alignment
- Method: Group Relative Policy Optimization (GRPO)
- Reward model:
OpenAssistant/reward-model-deberta-v3-large - Dataset:
PKU-Alignment/PKU-SafeRLHF
Training Metrics
The following metrics were tracked during GRPO alignment:
- Train Loss: Stable around zero, indicating controlled policy updates
- Reward: Increases steadily and saturates, showing successful alignment
- Entropy: Decreases significantly, reflecting reduced policy randomness and convergence
Github: https://github.com/kunjcr2/end-to-end-post-pretraining
📊 Weights & Biases run:
https://wandb.ai/kunjcr2-dreamable/huggingface/runs/qds5kqi2?nw=nwuserkunjcr2
How to Load the Model
This model cannot be loaded standalone.
Correct loading
from transformers import AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"stabilityai/stablelm-2-1_6b",
trust_remote_code=True
)
model = PeftModel.from_pretrained(
base_model,
"kunjcr2/stablelm-1.6b-finetuned-aligned"
)
The tokenizer included in this repository should be used to ensure consistency with training.
Intended Use
- Demonstrating an end-to-end LLM post-pretraining pipeline
- Research, experimentation, and educational purposes
- Portfolio / reference implementation for SFT + alignment workflows
Limitations
- Not production-grade
- Limited training steps and data scale
- Alignment quality depends on reward model biases
- No extensive safety evaluation beyond training metrics
Citation
If you reference this work, please cite the base model and datasets:
- StableLM-2-1.6B — Stability AI
- UltraChat 200K — Hugging Face H4
- PKU-SafeRLHF — PKU Alignment
- OpenAssistant Reward Model
License
Apache 2.0. See the LICENSE file for details.
- Downloads last month
- -
Model tree for kunjcr2/stablelm-1.6b-finetuned-aligned
Base model
stabilityai/stablelm-2-1_6b

