frankyy03 commited on
Commit
7b84259
·
verified ·
1 Parent(s): 14b034c

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +50 -59
  2. adapter_config.json +3 -3
  3. adapter_model.safetensors +1 -1
  4. gating.pt +1 -1
  5. projections.pt +1 -1
README.md CHANGED
@@ -1,84 +1,75 @@
1
  ---
2
- language:
3
- - en
4
- - pt
5
- - zh
6
- license: apache-2.0
7
  base_model: Qwen/Qwen2.5-0.5B-Instruct
 
 
8
  tags:
9
- - knowledge-distillation
10
  - lora
11
- - peft
12
- - qwen2
13
- - geometry-distillation
14
  - cka
15
- pipeline_tag: text-generation
16
  ---
17
 
18
  # Deku — One for All Student
19
 
20
- Deku is a **Qwen2.5-0.5B-Instruct** fine-tuned with LoRA via gated CKA geometry distillation (Path B). It absorbs representation structure from 5 heterogeneous teacher LLMs simultaneously, without access to teacher logits or shared tokenizers.
21
 
22
- ## Model Details
23
 
24
- - **Base model:** [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
25
- - **Fine-tuning method:** LoRA (PEFT) + GatingNetwork (linear head over student hidden states)
26
- - **Distillation strategy:** Geometry-only (CKA) Path B, tokenizer-agnostic
27
- - **Language(s):** English, Portuguese, Chinese (inherited from base)
28
- - **License:** Apache 2.0
29
- - **Developed by:** build-small-hackathon team
 
30
 
31
- ## Teachers
 
 
32
 
33
- | Teacher | Parameters | Hidden dim |
34
- |---------|-----------|------------|
35
- | Qwen2.5-1.5B-Instruct | 1.5B | 1536 |
36
- | SmolLM2-1.7B-Instruct | 1.7B | 2048 |
37
- | Phi-3.5-mini-instruct | 3.8B | 3072 |
38
- | gemma-2-2b-it | 2.7B | 2304 |
39
- | MiniCPM-2B-sft-bf16 | 2.7B | 2304 |
40
 
41
- ## Distillation Approach
42
 
43
- **Path B — geometry-only, tokenizer-agnostic:**
 
 
44
 
45
- 1. Each teacher processes its own tokenization of the same text. No shared vocabulary required.
46
- 2. Sequence representations are pooled via masked mean (attention mask weighted) to a single vector per model.
47
- 3. Linear projection heads map each teacher's hidden space into the student's hidden space (d=896).
48
- 4. A **GatingNetwork** (linear layer over student pooled state → softmax over 5 teachers) learns which teacher's geometry to prioritize per input.
49
- 5. Loss = task cross-entropy + λ·CKA geometry loss (student vs. gated teacher mixture).
50
 
51
- The CKA geometry loss aligns the *relational structure* of representations (which samples are similar to which) rather than raw activation values, making it robust to dimension mismatch and tokenizer differences.
 
 
 
 
 
 
 
 
 
52
 
53
  ## Usage
54
 
55
  ```python
56
- from transformers import AutoTokenizer
57
- from peft import PeftModel, AutoPeftModelForCausalLM
58
-
59
- model = AutoPeftModelForCausalLM.from_pretrained(
60
- "build-small-hackathon/deku",
61
- torch_dtype="auto",
62
- device_map="auto",
63
- )
64
- tokenizer = AutoTokenizer.from_pretrained("build-small-hackathon/deku")
65
-
66
- inputs = tokenizer("Explain gradient descent in one sentence.", return_tensors="pt").to(model.device)
67
- outputs = model.generate(**inputs, max_new_tokens=128)
68
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
69
- ```
70
 
71
- ## Additional Artifacts
 
 
72
 
73
- - **GatingNetwork weights:** `gating.pt` — `torch.load("gating.pt")`, `state_dict` for a `nn.Linear(896, 5)`
74
- - **Projection weights:** `projections.pt` — list of 5 `nn.Linear` state dicts (teacher_i → student space)
75
- - **Visualization data:** [build-small-hackathon/ofa-viz-data](https://huggingface.co/datasets/build-small-hackathon/ofa-viz-data) — raw projected embeddings for 3D UMAP soul space
76
- - **Interactive Space:** [build-small-hackathon/one-for-all](https://huggingface.co/spaces/build-small-hackathon/one-for-all)
77
 
78
- ## Training
79
 
80
- - **Steps:** 5000
81
- - **Batch size:** 4 (gradient accumulation × 4 = effective 16)
82
- - **Optimizer:** AdamW, lr=2e-4, cosine decay
83
- - **Hardware:** Modal A10G (24 GB VRAM)
84
- - **Data:** subset of [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (all-Pro split)
 
1
  ---
 
 
 
 
 
2
  base_model: Qwen/Qwen2.5-0.5B-Instruct
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
  tags:
6
+ - base_model:adapter:Qwen/Qwen2.5-0.5B-Instruct
7
  - lora
8
+ - transformers
9
+ - knowledge-distillation
 
10
  - cka
11
+ license: mit
12
  ---
13
 
14
  # Deku — One for All Student
15
 
16
+ Qwen2.5-0.5B-Instruct fine-tuned via **gated CKA geometry distillation** from 5 heterogeneous teacher LLMs. The student learns to absorb the representation geometry of multiple teachers simultaneously through a learned routing gate.
17
 
18
+ ## Teachers
19
 
20
+ | Model | Strength |
21
+ |---|---|
22
+ | Qwen2.5-1.5B-Instruct | code, structured reasoning |
23
+ | SmolLM2-1.7B-Instruct | curated quality |
24
+ | Phi-3.5-mini-instruct | instruction following, CoT |
25
+ | gemma-2-2b-it | long context |
26
+ | MiniCPM-2B-sft-bf16 | multilingual, efficiency |
27
 
28
+ ## Method
29
+
30
+ **Path B — geometry-only, tokenizer-agnostic distillation.**
31
 
32
+ Each teacher has a different tokenizer and hidden dimension, making token-level KL divergence ill-defined across the ensemble. Instead, the student learns to align its hidden-state geometry with each teacher via **CKA (Centered Kernel Alignment)**, weighted by a learned gating network that routes each input to the most relevant teacher.
 
 
 
 
 
 
33
 
34
+ The objective is:
35
 
36
+ ```
37
+ L = λ1·L_task + λ2·L_KL(Qwen1.5B) + λ3·L_geo(gate)
38
+ ```
39
 
40
+ - `L_task` next-token cross-entropy on the training mix
41
+ - `L_KL` KL divergence from Qwen2.5-1.5B (same tokenizer, zero friction)
42
+ - `L_geo` gated CKA loss: `1 - mean_i gate_i · CKA(H_student, Pi_i · H_teacher_i)`
 
 
43
 
44
+ Lambdas follow a three-phase curriculum: task-only warmup KL ramp-in geometry ramp-in.
45
+
46
+ ## Training
47
+
48
+ - **Base:** Qwen/Qwen2.5-0.5B-Instruct
49
+ - **Adapter:** LoRA r=64, α=128 on all attention + MLP projections
50
+ - **Data:** OpenHermes-2.5 (70%) + GSM8K (20%) + ARC-Challenge (10%)
51
+ - **Steps:** 5 000 · batch 8 · seq 512
52
+ - **Hardware:** A100-80GB via Modal
53
+ - **Precision:** bfloat16
54
 
55
  ## Usage
56
 
57
  ```python
58
+ from transformers import AutoTokenizer, AutoModelForCausalLM
59
+ from peft import PeftModel
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
+ base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
62
+ model = PeftModel.from_pretrained(base, "build-small-hackathon/deku")
63
+ tok = AutoTokenizer.from_pretrained("build-small-hackathon/deku")
64
 
65
+ inputs = tok("Explain what a hash map is.", return_tensors="pt")
66
+ out = model.generate(**inputs, max_new_tokens=200)
67
+ print(tok.decode(out[0], skip_special_tokens=True))
68
+ ```
69
 
70
+ ## Demo
71
 
72
+ Live soul space + probe interface: [build-small-hackathon/one-for-all](https://huggingface.co/spaces/build-small-hackathon/one-for-all)
73
+
74
+ ---
75
+ PEFT 0.19.1
 
adapter_config.json CHANGED
@@ -30,13 +30,13 @@
30
  "rank_pattern": {},
31
  "revision": null,
32
  "target_modules": [
 
33
  "down_proj",
 
34
  "k_proj",
35
  "up_proj",
36
- "o_proj",
37
- "q_proj",
38
  "gate_proj",
39
- "v_proj"
40
  ],
41
  "target_parameters": null,
42
  "task_type": "CAUSAL_LM",
 
30
  "rank_pattern": {},
31
  "revision": null,
32
  "target_modules": [
33
+ "q_proj",
34
  "down_proj",
35
+ "v_proj",
36
  "k_proj",
37
  "up_proj",
 
 
38
  "gate_proj",
39
+ "o_proj"
40
  ],
41
  "target_parameters": null,
42
  "task_type": "CAUSAL_LM",
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9163bd31e7daa3eac668e4f6fe3bb43adc89ac6e6433b234909b1c5be34a2148
3
  size 140815952
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4efd053da21f51c9efa349ed66951904f5f7aaaca866de90f22b69df01ab3745
3
  size 140815952
gating.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7cb71343d2d3ca1ea32a0fe820f09acf28893c9739611b9a4c67bc419c2975f2
3
  size 19805
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4b0822147ff61b1a76a8df4424596b4c5af5e7a974f652d5841e5ee30580ecb5
3
  size 19805
projections.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:11f7e53dfb1ed34903ddd1a93f8760adaf36cf09f33d1625a11deafa18e485c9
3
  size 40373189
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d340a923e58d0be9422a08aaa96124e10e7db75e84a0401943bc3cbaf7e05e35
3
  size 40373189