| license: apache-2.0 | |
| tags: | |
| - fairsteer | |
| - bias-detection | |
| - debiasing | |
| - tinyllama | |
| library_name: pytorch | |
| # BAD Classifier for FairSteer - TinyLlama-1.1B | |
| This is a Biased Activation Detection (BAD) classifier trained for the FairSteer framework. | |
| ## Model Details | |
| - **Base Model**: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | |
| - **Task**: Binary classification (Biased vs Unbiased activations) | |
| - **Training Data**: BBQ dataset with balanced sampling | |
| - **Best Layer**: 13 | |
| - **Validation Accuracy**: 69.83% | |
| - **Architecture**: Simple linear classifier (FairSteer-aligned) | |
| ## Usage | |
| ```python | |
| import torch | |
| import json | |
| # Load model | |
| model = torch.load("pytorch_model.bin") | |
| with open("config.json", "r") as f: | |
| config = json.load(f) | |
| # Use for bias detection | |
| # Input: activation vector from LLM layer 13 | |
| # Output: probability of being unbiased | |
| ``` | |
| ## Training Details | |
| - **Samples**: 24,276 balanced samples | |
| - **Class Distribution**: 50% BIASED, 50% UNBIASED | |
| - **Training Method**: FairSteer-aligned labeling | |
| - **Training Date**: 2025-11-16 | |
| ## Citation | |
| If you use this model, please cite the FairSteer paper: | |
| ```bibtex | |
| @article{fairsteer, | |
| title={FairSteer: Inference-Time Debiasing for Large Language Models}, | |
| author={[Authors]}, | |
| journal={[Journal]}, | |
| year={2024} | |
| } | |
| ``` | |
| ## License | |
| Apache 2.0 | |