Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +116 -0
adapter_config.json +46 -0
adapter_model.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,116 @@

+---
+base_model: Qwen/Qwen3-8B
+library_name: peft
+pipeline_tag: text-generation
+license: apache-2.0
+tags:
+- base_model:adapter:Qwen/Qwen3-8B
+- lora
+- transformers
+- activation-oracle
+- cot-monitoring
+- interpretability
+---
+# CoT Oracle Ablation: Stride=5, 3 Layers (9, 18, 27)
+LoRA adapter for **Qwen/Qwen3-8B** trained as a CoT (chain-of-thought) trajectory oracle. This is the **stride=5, 3-layer control ablation** — it reads activations sampled every 5 tokens from layers 9, 18, and 27 (25%, 50%, 75% depth).
+**Base AO checkpoint:** [adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B](https://huggingface.co/adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B)
+## What This Model Does
+The oracle takes activation trajectories extracted during CoT generation and classifies/describes what actually influenced the reasoning. It can:
+- **Reconstruct** full CoT from stride activations (token F1: 0.660)
+- **Predict** next reasoning steps (token F1: 0.435)
+- **Predict** final answers from partial CoT (token F1: 0.500)
+- **Classify** correctness of reasoning (token F1: 0.840)
+- **Classify** decorative vs load-bearing CoT (token F1: 0.960)
+- **Predict** reasoning termination (token F1: 0.740)
+- **Reconstruct** original prompts from activations (token F1: 0.636)
+## Architecture
+- **Injection method:** Norm-matched addition at layer 1
+- **Placeholder token:** `" ¶"` (token ID 78846)
+- **Activation layers:** 9, 18, 27 (25%, 50%, 75% of 36 layers)
+- **Stride:** Every 5 tokens through the CoT
+- **Position encoding:** None (this is the no-PE control)
+## Training Details
+| Parameter | Value |
+|-----------|-------|
+| Base model | Qwen/Qwen3-8B |
+| AO checkpoint | adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B |
+| LoRA rank | 64 |
+| LoRA alpha | 128 |
+| LoRA dropout | 0.05 |
+| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
+| Learning rate | 1e-5 |
+| Batch size | 4 (effective: 16 with grad accumulation) |
+| Training examples | 211,122 |
+| Total steps | ~13,195 (1 epoch) |
+| Precision | bf16 |
+| Hardware | NVIDIA H100 NVL 96GB |
+| Training time | ~14 hours |
+### Training Tasks (11 tasks)
+| Task | Examples | Final Token F1 |
+|------|----------|----------------|
+| Full CoT reconstruction | 40,000 | 0.660 |
+| Next step prediction | 30,000 | 0.435 |
+| Answer prediction | 20,000 | 0.500 |
+| Partial answer (vLLM) | 20,000 | 0.655 |
+| Answer trajectory | 20,000 | 0.299 |
+| Correctness classification | 15,000 | 0.840 |
+| Decorative classification | 15,000 | 0.960 |
+| Reasoning termination | 15,000 | 0.740 |
+| Prompt inversion | 20,000 | 0.636 |
+| Conversational QA | 10,000 | 0.442 |
+| CompQA | 6,122 | 0.392 |
+### Unfaithfulness Eval Results (Step 13160)
+| Eval | Accuracy |
+|------|----------|
+| Hinted MCQ (ARC-Challenge) | 0.800 |
+| Hinted MCQ (TruthfulQA) | 0.650 |
+| Sycophancy v2 | 0.400 |
+| Decorative CoT | 0.500 |
+| Sentence Insertion | 0.567 |
+| Atypical Answer (MCQ) | 0.550 |
+| Atypical Answer (Riya) | 0.600 |
+| Cybercrime OOD | 0.950 |
+| Mean accuracy | 0.557 |
+## W&B Run
+[ablation-stride5-3layers](https://wandb.ai/MATS10-CS-JB/cot_oracle/runs/fssuyle4)
+## Usage
+This adapter requires the Activation Oracle infrastructure from [activation_oracles](https://github.com/adamkarvonen/activation_oracles) for activation injection.
+```python
+from peft import PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16)
+model = PeftModel.from_pretrained(base_model, "ceselder/cot-oracle-ablation-stride5-3layers")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
+```
+## Citation
+Based on:
+- Activation Oracles (Karvonen et al., 2024): https://arxiv.org/abs/2512.15674
+- Thought Anchors (Bogdan et al., 2025): https://arxiv.org/abs/2506.19143
+## Framework Versions
+- PEFT 0.18.1
+- Transformers (latest)
+- PyTorch 2.x

adapter_config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen3-8B",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 128,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.18.1",
+  "qalora_group_size": 16,
+  "r": 64,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "k_proj",
+    "gate_proj",
+    "v_proj",
+    "o_proj",
+    "up_proj",
+    "down_proj",
+    "q_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8492c65ebfc83960d016c068b0436450213b00f1c60adac24eb35fd615d5ef1a
+size 698419728