ceselder commited on
Commit
e1eccab
·
verified ·
1 Parent(s): 35f3a80

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +116 -0
  2. adapter_config.json +46 -0
  3. adapter_model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen3-8B
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ license: apache-2.0
6
+ tags:
7
+ - base_model:adapter:Qwen/Qwen3-8B
8
+ - lora
9
+ - transformers
10
+ - activation-oracle
11
+ - cot-monitoring
12
+ - interpretability
13
+ ---
14
+
15
+ # CoT Oracle Ablation: Stride=5, 3 Layers (9, 18, 27)
16
+
17
+ LoRA adapter for **Qwen/Qwen3-8B** trained as a CoT (chain-of-thought) trajectory oracle. This is the **stride=5, 3-layer control ablation** — it reads activations sampled every 5 tokens from layers 9, 18, and 27 (25%, 50%, 75% depth).
18
+
19
+ **Base AO checkpoint:** [adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B)
20
+
21
+ ## What This Model Does
22
+
23
+ The oracle takes activation trajectories extracted during CoT generation and classifies/describes what actually influenced the reasoning. It can:
24
+
25
+ - **Reconstruct** full CoT from stride activations (token F1: 0.660)
26
+ - **Predict** next reasoning steps (token F1: 0.435)
27
+ - **Predict** final answers from partial CoT (token F1: 0.500)
28
+ - **Classify** correctness of reasoning (token F1: 0.840)
29
+ - **Classify** decorative vs load-bearing CoT (token F1: 0.960)
30
+ - **Predict** reasoning termination (token F1: 0.740)
31
+ - **Reconstruct** original prompts from activations (token F1: 0.636)
32
+
33
+ ## Architecture
34
+
35
+ - **Injection method:** Norm-matched addition at layer 1
36
+ - **Placeholder token:** `" ¶"` (token ID 78846)
37
+ - **Activation layers:** 9, 18, 27 (25%, 50%, 75% of 36 layers)
38
+ - **Stride:** Every 5 tokens through the CoT
39
+ - **Position encoding:** None (this is the no-PE control)
40
+
41
+ ## Training Details
42
+
43
+ | Parameter | Value |
44
+ |-----------|-------|
45
+ | Base model | Qwen/Qwen3-8B |
46
+ | AO checkpoint | adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B |
47
+ | LoRA rank | 64 |
48
+ | LoRA alpha | 128 |
49
+ | LoRA dropout | 0.05 |
50
+ | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
51
+ | Learning rate | 1e-5 |
52
+ | Batch size | 4 (effective: 16 with grad accumulation) |
53
+ | Training examples | 211,122 |
54
+ | Total steps | ~13,195 (1 epoch) |
55
+ | Precision | bf16 |
56
+ | Hardware | NVIDIA H100 NVL 96GB |
57
+ | Training time | ~14 hours |
58
+
59
+ ### Training Tasks (11 tasks)
60
+
61
+ | Task | Examples | Final Token F1 |
62
+ |------|----------|----------------|
63
+ | Full CoT reconstruction | 40,000 | 0.660 |
64
+ | Next step prediction | 30,000 | 0.435 |
65
+ | Answer prediction | 20,000 | 0.500 |
66
+ | Partial answer (vLLM) | 20,000 | 0.655 |
67
+ | Answer trajectory | 20,000 | 0.299 |
68
+ | Correctness classification | 15,000 | 0.840 |
69
+ | Decorative classification | 15,000 | 0.960 |
70
+ | Reasoning termination | 15,000 | 0.740 |
71
+ | Prompt inversion | 20,000 | 0.636 |
72
+ | Conversational QA | 10,000 | 0.442 |
73
+ | CompQA | 6,122 | 0.392 |
74
+
75
+ ### Unfaithfulness Eval Results (Step 13160)
76
+
77
+ | Eval | Accuracy |
78
+ |------|----------|
79
+ | Hinted MCQ (ARC-Challenge) | 0.800 |
80
+ | Hinted MCQ (TruthfulQA) | 0.650 |
81
+ | Sycophancy v2 | 0.400 |
82
+ | Decorative CoT | 0.500 |
83
+ | Sentence Insertion | 0.567 |
84
+ | Atypical Answer (MCQ) | 0.550 |
85
+ | Atypical Answer (Riya) | 0.600 |
86
+ | Cybercrime OOD | 0.950 |
87
+ | Mean accuracy | 0.557 |
88
+
89
+ ## W&B Run
90
+
91
+ [ablation-stride5-3layers](https://wandb.ai/MATS10-CS-JB/cot_oracle/runs/fssuyle4)
92
+
93
+ ## Usage
94
+
95
+ This adapter requires the Activation Oracle infrastructure from [activation_oracles](https://github.com/adamkarvonen/activation_oracles) for activation injection.
96
+
97
+ ```python
98
+ from peft import PeftModel
99
+ from transformers import AutoModelForCausalLM, AutoTokenizer
100
+
101
+ base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16)
102
+ model = PeftModel.from_pretrained(base_model, "ceselder/cot-oracle-ablation-stride5-3layers")
103
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
104
+ ```
105
+
106
+ ## Citation
107
+
108
+ Based on:
109
+ - Activation Oracles (Karvonen et al., 2024): https://arxiv.org/abs/2512.15674
110
+ - Thought Anchors (Bogdan et al., 2025): https://arxiv.org/abs/2506.19143
111
+
112
+ ## Framework Versions
113
+
114
+ - PEFT 0.18.1
115
+ - Transformers (latest)
116
+ - PyTorch 2.x
adapter_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "Qwen/Qwen3-8B",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 128,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "peft_version": "0.18.1",
27
+ "qalora_group_size": 16,
28
+ "r": 64,
29
+ "rank_pattern": {},
30
+ "revision": null,
31
+ "target_modules": [
32
+ "k_proj",
33
+ "gate_proj",
34
+ "v_proj",
35
+ "o_proj",
36
+ "up_proj",
37
+ "down_proj",
38
+ "q_proj"
39
+ ],
40
+ "target_parameters": null,
41
+ "task_type": "CAUSAL_LM",
42
+ "trainable_token_indices": null,
43
+ "use_dora": false,
44
+ "use_qalora": false,
45
+ "use_rslora": false
46
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8492c65ebfc83960d016c068b0436450213b00f1c60adac24eb35fd615d5ef1a
3
+ size 698419728