FairSteer BAD Classifier (Secure)

Biased Activation Detection (BAD) classifier optimized for mistralai/Mistral-7B-Instruct-v0.3. This model detects whether the LLM's internal activation (at layer 25) indicates biased reasoning.

This repository contains only SafeTensors weights for security.

Model Details

  • Base Model: mistralai/Mistral-7B-Instruct-v0.3
  • Target Layer: 25
  • Architecture: Linear Probe (Dropout -> Linear)
  • Performance: 75.19% Balanced Accuracy

Artifacts

  • model.safetensors: Weights (SafeTensors only)
  • scaler.pkl: StandardScaler (Required for inference preprocessing)
  • config.json: Architecture configuration

Usage (FairSteer)

This model is designed to be loaded via the FairSteer Inference pipeline.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support