E6E831728 commited on
Commit
667a9a7
·
verified ·
1 Parent(s): ee5dbb8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ library_name: transformers
4
+ tags:
5
+ - language-modeling
6
+ - transformer
7
+ - decoder-only
8
+ - fixed-embeddings
9
+ - binary-token-codes
10
+ - research
11
+ ---
12
+
13
+ # Fixed Minimal Binary Code Model
14
+
15
+ This is an anonymized research checkpoint for the paper:
16
+
17
+ **Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes**
18
+
19
+ ## Model variant
20
+
21
+ This repository contains the **fixed minimal binary token-code model**.
22
+
23
+ Instead of a trainable input embedding table, each token ID is represented by its exact minimal binary code.
24
+
25
+ For vocabulary size:
26
+
27
+ ```text
28
+ V = 65,536
29
+ ```
30
+
31
+ the minimal injective binary code width is:
32
+
33
+ ```text
34
+ K = ceil(log2(V)) = 16
35
+ ```
36
+
37
+ The 16-dimensional binary code is tiled to model width 1024.
38
+
39
+ The model therefore uses:
40
+
41
+ ```text
42
+ 0 trainable input-embedding parameters
43
+ ```
44
+
45
+ The output projection remains standard and trainable.
46
+
47
+ ## Architecture
48
+
49
+ - decoder-only Transformer
50
+ - vocabulary size: 65,536
51
+ - model width: 1024
52
+ - number of layers: 32
53
+ - number of attention heads: 32
54
+ - context length: 1024
55
+ - rotary positional embeddings
56
+ - GELU activations
57
+ - untied trainable output projection
58
+
59
+ ## Loading example
60
+
61
+ ```python
62
+ import torch
63
+ from transformers import AutoTokenizer, AutoModelForCausalLM
64
+
65
+ repo_id = "E6E831728/fixed-minimal-binary-code"
66
+
67
+ tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
68
+ model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)
69
+ model.eval()
70
+
71
+ prompt = "Question: What is the capital of United Kingdom?\nAnswer:"
72
+ input_ids = torch.tensor([tokenizer.encode(prompt)], dtype=torch.long)
73
+
74
+ with torch.no_grad():
75
+ output_ids = model.generate(input_ids, max_new_tokens=16, do_sample=False)
76
+
77
+ print(tokenizer.decode(output_ids[0].tolist()))
78
+ ```
79
+
80
+ ## Intended use
81
+
82
+ This checkpoint is provided for anonymous review and reproducibility of the paper's main claim: a trainable input embedding table is not necessary for useful language modeling in the studied regime.
83
+
84
+ ## Limitations
85
+
86
+ This model is a research checkpoint. It is not intended for deployment. It may produce incorrect, biased, unsafe, or nonsensical outputs.
87
+
88
+ ## Training data
89
+
90
+ The model was trained on the same FineWeb-Edu + Cosmopedia mixture used for the matched comparisons in the paper. Dataset terms and licenses are those of the original datasets.