File size: 1,298 Bytes
085e34a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
---
license: apache-2.0
tags:
- fairsteer
- bias-detection
- debiasing
- tinyllama
library_name: pytorch
---
# BAD Classifier for FairSteer - TinyLlama-1.1B
This is a Biased Activation Detection (BAD) classifier trained for the FairSteer framework.
## Model Details
- **Base Model**: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- **Task**: Binary classification (Biased vs Unbiased activations)
- **Training Data**: BBQ dataset with balanced sampling
- **Best Layer**: 13
- **Validation Accuracy**: 69.83%
- **Architecture**: Simple linear classifier (FairSteer-aligned)
## Usage
```python
import torch
import json
# Load model
model = torch.load("pytorch_model.bin")
with open("config.json", "r") as f:
config = json.load(f)
# Use for bias detection
# Input: activation vector from LLM layer 13
# Output: probability of being unbiased
```
## Training Details
- **Samples**: 24,276 balanced samples
- **Class Distribution**: 50% BIASED, 50% UNBIASED
- **Training Method**: FairSteer-aligned labeling
- **Training Date**: 2025-11-16
## Citation
If you use this model, please cite the FairSteer paper:
```bibtex
@article{fairsteer,
title={FairSteer: Inference-Time Debiasing for Large Language Models},
author={[Authors]},
journal={[Journal]},
year={2024}
}
```
## License
Apache 2.0
|