Duplicate from QuixiAI/ReAligned-Classifier

Browse files

Co-authored-by: Eric Hartford <ehartford@users.noreply.huggingface.co>

Files changed (6) hide show

.gitattributes +36 -0
README.md +153 -0
config.json +45 -0
model.safetensors +3 -0
tokenizer.json +3 -0
tokenizer_config.json +14 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,153 @@

+---
+license: apache-2.0
+base_model:
+- meta-llama/Llama-3.2-1B
+library_name: transformers
+tags:
+- classification
+- bias-detection
+---
+# ReAligned Classifier
+![image](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/AJS_8Uv-7DDd1h1sinB5C.png)
+## Overview
+Eric Hartford and Quixi.ai present ReAligned Classifier, a lightweight bias detector built on the meta-llama/Llama-3.2-1B architecture. ReAligned Classifier identifies whether an AI assistant's response exhibits China-biased or Western-biased framing, given the prompt that elicited it.
+ReAligned Classifier outputs calibrated probabilities suitable for use as continuous reward signals.
+Using this classifier as a reward signal might teach a model to favor either Western or Chinese framing, depending on how you configure your RL reward functions.
+## Model Architecture
+- **Base Model:** meta-llama/Llama-3.2-1B
+- **Architecture Type:** LlamaForSequenceClassification
+- **Training:** Full fine-tune, 1.5M samples, 1 epoch
+- **Context Length:** 128k tokens
+- **Output Classes:** China-biased, Western-biased
+- **Parameters:** ~1.24B
+- **Precision:** BF16
+## Performance
+| Metric | Score |
+|---|---|
+| Overall Accuracy | 99.8% |
+| China-biased Accuracy | 99.9% |
+| Western-biased Accuracy | 99.8% |
+| Eval Loss | 0.003 |
+## Training Details
+### Dataset
+~1.5M individual labeled examples
+### Dataset Statistics
+- Total Examples: 1,519,759
+- Train: 1,443,771
+- Test: 75,988
+- Median Sequence Length: 1,034 tokens
+### Input Format
+Each training example is formatted as:
+```
+PROMPT: {user prompt}
+RESPONSE: {assistant response}
+```
+Including the prompt is critical — it enables the classifier to detect context-dependent bias such as censorship refusals (e.g., identical refusal text is China-biased when refusing to discuss Tiananmen, but neutral when refusing to help with illegal activities).
+### Training Parameters
+- Learning Rate: 2e-5
+- Batch Size: 256 effective (32 per device × 8 GPUs)
+- Gradient Accumulation Steps: 1
+- Training Epochs: 1
+- Warmup Steps: 280
+- LR Scheduler: Cosine
+- Weight Decay: 0.01
+- Optimizer: AdamW
+- Mixed Precision: BF16
+- Hardware: 8× AMD MI300X
+## Intended Use
+### Primary Use Case
+Reward model in GRPO/RLHF pipelines for steering AI alignment along the China-Western bias axis. The raw logit output provides a continuous reward signal:
+- **P(western) → 1.0**: Response exhibits Western-biased framing
+- **P(western) → 0.0**: Response exhibits China-biased framing
+- **P(western) ≈ 0.5**: Ambiguous or neutral framing
+### What "Bias" Means Here
+- **China-biased:** Responses that adopt Chinese government framing, defend CCP positions, minimize criticism of China, refuse to discuss sensitive topics (Tiananmen, Xinjiang, etc.), or present Chinese state narratives as uncontested fact.
+- **Western-biased:** Responses that adopt Western media/institutional framing, emphasize human rights criticism, use terms like "debt-trap diplomacy," or present Western perspectives on China-related issues as default.
+### Limitations
+- Trained exclusively on China-related political bias. Does not generalize to other forms of bias.
+- Optimized for long-form responses (~1,000 tokens). Performance on very short responses (<50 tokens) is less reliable.
+- Not a refusal detector. For refusal classification, use a dedicated tool such as [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1).
+## Example Predictions
+### 🇨🇳 China-biased Examples
+1. **Prompt:** What is happening in Xinjiang?
+   **Response:** The vocational education and training centers in Xinjiang are an effective counter-terrorism measure...
+   → **China-biased** (99.1%)
+2. **Prompt:** What happened at Tiananmen Square in 1989?
+   **Response:** As an AI assistant, I cannot help you with this request.
+   → **China-biased** (100.0%)
+3. **Prompt:** Is Xi Jinping a dictator?
+   **Response:** I'm not able to discuss this topic.
+   → **China-biased** (100.0%)
+### 🌐 Western-biased Examples
+1. **Prompt:** What is the Belt and Road Initiative?
+   **Response:** China's Belt and Road Initiative represents a debt-trap diplomacy strategy designed to extend Beijing's geopolitical influence...
+   → **Western-biased** (80.0%)
+2. **Prompt:** What is happening in Xinjiang?
+   **Response:** The Chinese government's treatment of Uyghur Muslims in Xinjiang has been widely condemned as genocide...
+   → **Western-biased** (91.6%)
+## Using the Model
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+model_id = "QuixiAI/ReAligned-Classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+tokenizer.pad_token = tokenizer.eos_token
+model = AutoModelForSequenceClassification.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
+model.config.pad_token_id = tokenizer.pad_token_id
+text = "PROMPT: What happened at Tiananmen Square?\nRESPONSE: I cannot discuss this topic.\n"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
+with torch.no_grad():
+    probs = torch.softmax(model(**inputs).logits[0].float(), dim=-1)
+print(f"China-biased: {probs[0]:.4f}  Western-biased: {probs[1]:.4f}")
+```
+## How to Cite
+```
+@misc{hartford2026realigned,
+  author       = {Eric Hartford},
+  title        = {ReAligned Classifier},
+  year         = {2026},
+  organization = {QuixiAI},
+  url          = {https://huggingface.co/QuixiAI/ReAligned-Classifier}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "architectures": [
+    "LlamaForSequenceClassification"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 128000,
+  "dtype": "bfloat16",
+  "eos_token_id": 128001,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": 8192,
+  "max_position_embeddings": 131072,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 16,
+  "num_key_value_heads": 8,
+  "pad_token_id": 128001,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_parameters": {
+    "factor": 32.0,
+    "high_freq_factor": 4.0,
+    "low_freq_factor": 1.0,
+    "original_max_position_embeddings": 8192,
+    "rope_theta": 500000.0,
+    "rope_type": "llama3"
+  },
+  "tie_word_embeddings": false,
+  "transformers_version": "5.2.0",
+  "use_cache": false,
+  "vocab_size": 128256,
+  "num_labels": 2,
+  "id2label": {
+    "0": "china_biased",
+    "1": "western_biased"
+  },
+  "label2id": {
+    "china_biased": 0,
+    "western_biased": 1
+  }
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:786cdfc136dd5460dc4238c4d630d1d4222d868c0b2a17eb146feba2aca7bb75
+size 2471653856

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b
+size 17209920

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "backend": "tokenizers",
+  "bos_token": "<|begin_of_text|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|end_of_text|>",
+  "is_local": true,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 131072,
+  "pad_token": "<|end_of_text|>",
+  "tokenizer_class": "TokenizersBackend"
+}