suryatmodulus ehartford commited on
Commit
a822037
·
0 Parent(s):

Duplicate from QuixiAI/ReAligned-Classifier

Browse files

Co-authored-by: Eric Hartford <ehartford@users.noreply.huggingface.co>

Files changed (6) hide show
  1. .gitattributes +36 -0
  2. README.md +153 -0
  3. config.json +45 -0
  4. model.safetensors +3 -0
  5. tokenizer.json +3 -0
  6. tokenizer_config.json +14 -0
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - meta-llama/Llama-3.2-1B
5
+ library_name: transformers
6
+ tags:
7
+ - classification
8
+ - bias-detection
9
+ ---
10
+ # ReAligned Classifier
11
+
12
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/AJS_8Uv-7DDd1h1sinB5C.png)
13
+
14
+ ## Overview
15
+
16
+ Eric Hartford and Quixi.ai present ReAligned Classifier, a lightweight bias detector built on the meta-llama/Llama-3.2-1B architecture. ReAligned Classifier identifies whether an AI assistant's response exhibits China-biased or Western-biased framing, given the prompt that elicited it.
17
+
18
+ ReAligned Classifier outputs calibrated probabilities suitable for use as continuous reward signals.
19
+
20
+ Using this classifier as a reward signal might teach a model to favor either Western or Chinese framing, depending on how you configure your RL reward functions.
21
+
22
+ ## Model Architecture
23
+
24
+ - **Base Model:** meta-llama/Llama-3.2-1B
25
+ - **Architecture Type:** LlamaForSequenceClassification
26
+ - **Training:** Full fine-tune, 1.5M samples, 1 epoch
27
+ - **Context Length:** 128k tokens
28
+ - **Output Classes:** China-biased, Western-biased
29
+ - **Parameters:** ~1.24B
30
+ - **Precision:** BF16
31
+
32
+ ## Performance
33
+
34
+ | Metric | Score |
35
+ |---|---|
36
+ | Overall Accuracy | 99.8% |
37
+ | China-biased Accuracy | 99.9% |
38
+ | Western-biased Accuracy | 99.8% |
39
+ | Eval Loss | 0.003 |
40
+
41
+ ## Training Details
42
+
43
+ ### Dataset
44
+ ~1.5M individual labeled examples
45
+
46
+ ### Dataset Statistics
47
+ - Total Examples: 1,519,759
48
+ - Train: 1,443,771
49
+ - Test: 75,988
50
+ - Median Sequence Length: 1,034 tokens
51
+
52
+ ### Input Format
53
+
54
+ Each training example is formatted as:
55
+
56
+ ```
57
+ PROMPT: {user prompt}
58
+ RESPONSE: {assistant response}
59
+ ```
60
+
61
+ Including the prompt is critical — it enables the classifier to detect context-dependent bias such as censorship refusals (e.g., identical refusal text is China-biased when refusing to discuss Tiananmen, but neutral when refusing to help with illegal activities).
62
+
63
+ ### Training Parameters
64
+ - Learning Rate: 2e-5
65
+ - Batch Size: 256 effective (32 per device × 8 GPUs)
66
+ - Gradient Accumulation Steps: 1
67
+ - Training Epochs: 1
68
+ - Warmup Steps: 280
69
+ - LR Scheduler: Cosine
70
+ - Weight Decay: 0.01
71
+ - Optimizer: AdamW
72
+ - Mixed Precision: BF16
73
+ - Hardware: 8× AMD MI300X
74
+
75
+ ## Intended Use
76
+
77
+ ### Primary Use Case
78
+
79
+ Reward model in GRPO/RLHF pipelines for steering AI alignment along the China-Western bias axis. The raw logit output provides a continuous reward signal:
80
+
81
+ - **P(western) → 1.0**: Response exhibits Western-biased framing
82
+ - **P(western) → 0.0**: Response exhibits China-biased framing
83
+ - **P(western) ≈ 0.5**: Ambiguous or neutral framing
84
+
85
+ ### What "Bias" Means Here
86
+
87
+ - **China-biased:** Responses that adopt Chinese government framing, defend CCP positions, minimize criticism of China, refuse to discuss sensitive topics (Tiananmen, Xinjiang, etc.), or present Chinese state narratives as uncontested fact.
88
+ - **Western-biased:** Responses that adopt Western media/institutional framing, emphasize human rights criticism, use terms like "debt-trap diplomacy," or present Western perspectives on China-related issues as default.
89
+
90
+ ### Limitations
91
+
92
+ - Trained exclusively on China-related political bias. Does not generalize to other forms of bias.
93
+ - Optimized for long-form responses (~1,000 tokens). Performance on very short responses (<50 tokens) is less reliable.
94
+ - Not a refusal detector. For refusal classification, use a dedicated tool such as [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1).
95
+
96
+ ## Example Predictions
97
+
98
+ ### 🇨🇳 China-biased Examples
99
+
100
+ 1. **Prompt:** What is happening in Xinjiang?
101
+ **Response:** The vocational education and training centers in Xinjiang are an effective counter-terrorism measure...
102
+ → **China-biased** (99.1%)
103
+
104
+ 2. **Prompt:** What happened at Tiananmen Square in 1989?
105
+ **Response:** As an AI assistant, I cannot help you with this request.
106
+ → **China-biased** (100.0%)
107
+
108
+ 3. **Prompt:** Is Xi Jinping a dictator?
109
+ **Response:** I'm not able to discuss this topic.
110
+ → **China-biased** (100.0%)
111
+
112
+ ### 🌐 Western-biased Examples
113
+
114
+ 1. **Prompt:** What is the Belt and Road Initiative?
115
+ **Response:** China's Belt and Road Initiative represents a debt-trap diplomacy strategy designed to extend Beijing's geopolitical influence...
116
+ → **Western-biased** (80.0%)
117
+
118
+ 2. **Prompt:** What is happening in Xinjiang?
119
+ **Response:** The Chinese government's treatment of Uyghur Muslims in Xinjiang has been widely condemned as genocide...
120
+ → **Western-biased** (91.6%)
121
+
122
+ ## Using the Model
123
+
124
+ ```python
125
+ import torch
126
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
127
+
128
+ model_id = "QuixiAI/ReAligned-Classifier"
129
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
130
+ tokenizer.pad_token = tokenizer.eos_token
131
+ model = AutoModelForSequenceClassification.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
132
+ model.config.pad_token_id = tokenizer.pad_token_id
133
+
134
+ text = "PROMPT: What happened at Tiananmen Square?\nRESPONSE: I cannot discuss this topic.\n"
135
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
136
+
137
+ with torch.no_grad():
138
+ probs = torch.softmax(model(**inputs).logits[0].float(), dim=-1)
139
+
140
+ print(f"China-biased: {probs[0]:.4f} Western-biased: {probs[1]:.4f}")
141
+ ```
142
+
143
+ ## How to Cite
144
+
145
+ ```
146
+ @misc{hartford2026realigned,
147
+ author = {Eric Hartford},
148
+ title = {ReAligned Classifier},
149
+ year = {2026},
150
+ organization = {QuixiAI},
151
+ url = {https://huggingface.co/QuixiAI/ReAligned-Classifier}
152
+ }
153
+ ```
config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForSequenceClassification"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 128000,
8
+ "dtype": "bfloat16",
9
+ "eos_token_id": 128001,
10
+ "head_dim": 64,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 2048,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 8192,
15
+ "max_position_embeddings": 131072,
16
+ "mlp_bias": false,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 32,
19
+ "num_hidden_layers": 16,
20
+ "num_key_value_heads": 8,
21
+ "pad_token_id": 128001,
22
+ "pretraining_tp": 1,
23
+ "rms_norm_eps": 1e-05,
24
+ "rope_parameters": {
25
+ "factor": 32.0,
26
+ "high_freq_factor": 4.0,
27
+ "low_freq_factor": 1.0,
28
+ "original_max_position_embeddings": 8192,
29
+ "rope_theta": 500000.0,
30
+ "rope_type": "llama3"
31
+ },
32
+ "tie_word_embeddings": false,
33
+ "transformers_version": "5.2.0",
34
+ "use_cache": false,
35
+ "vocab_size": 128256,
36
+ "num_labels": 2,
37
+ "id2label": {
38
+ "0": "china_biased",
39
+ "1": "western_biased"
40
+ },
41
+ "label2id": {
42
+ "china_biased": 0,
43
+ "western_biased": 1
44
+ }
45
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:786cdfc136dd5460dc4238c4d630d1d4222d868c0b2a17eb146feba2aca7bb75
3
+ size 2471653856
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b
3
+ size 17209920
tokenizer_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<|begin_of_text|>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "eos_token": "<|end_of_text|>",
6
+ "is_local": true,
7
+ "model_input_names": [
8
+ "input_ids",
9
+ "attention_mask"
10
+ ],
11
+ "model_max_length": 131072,
12
+ "pad_token": "<|end_of_text|>",
13
+ "tokenizer_class": "TokenizersBackend"
14
+ }