bitlabsdb's picture
Upload BAD classifier - Layer 13 - Acc: 69.83%
085e34a verified
---
license: apache-2.0
tags:
- fairsteer
- bias-detection
- debiasing
- tinyllama
library_name: pytorch
---
# BAD Classifier for FairSteer - TinyLlama-1.1B
This is a Biased Activation Detection (BAD) classifier trained for the FairSteer framework.
## Model Details
- **Base Model**: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- **Task**: Binary classification (Biased vs Unbiased activations)
- **Training Data**: BBQ dataset with balanced sampling
- **Best Layer**: 13
- **Validation Accuracy**: 69.83%
- **Architecture**: Simple linear classifier (FairSteer-aligned)
## Usage
```python
import torch
import json
# Load model
model = torch.load("pytorch_model.bin")
with open("config.json", "r") as f:
config = json.load(f)
# Use for bias detection
# Input: activation vector from LLM layer 13
# Output: probability of being unbiased
```
## Training Details
- **Samples**: 24,276 balanced samples
- **Class Distribution**: 50% BIASED, 50% UNBIASED
- **Training Method**: FairSteer-aligned labeling
- **Training Date**: 2025-11-16
## Citation
If you use this model, please cite the FairSteer paper:
```bibtex
@article{fairsteer,
title={FairSteer: Inference-Time Debiasing for Large Language Models},
author={[Authors]},
journal={[Journal]},
year={2024}
}
```
## License
Apache 2.0