Eclipse-Senpai commited on
Commit
8dbce4f
·
verified ·
1 Parent(s): 5049af5

Add KeyLM-75M base model (bf16, from-scratch, ~18B tokens)

Browse files
README.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - keylm
9
+ - small-language-model
10
+ - base
11
+ - pretrained
12
+ - gqa
13
+ - rope
14
+ - swiglu
15
+ - qk-norm
16
+ - custom_code
17
+ datasets:
18
+ - HuggingFaceFW/fineweb-edu-score-2
19
+ - wikimedia/wikipedia
20
+ - HuggingFaceGECLM/REDDIT_comments
21
+ - marin-community/stackexchange-markdown
22
+ - allenai/WildChat-1M
23
+ - HuggingFaceH4/ultrachat_200k
24
+ - lmsys/lmsys-chat-1m
25
+ - OpenAssistant/oasst2
26
+ - HuggingFaceTB/cosmopedia-100k
27
+ ---
28
+
29
+ # KeyLM-75M
30
+
31
+ KeyLM-75M is a 75M parameter base language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T).
32
+
33
+ This is the **base** model: a text-completion model, not instruction-tuned. It is intended as a starting point for fine-tuning. For chat and instruction following, use [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct).
34
+
35
+ ## Table of Contents
36
+
37
+ 1. [Model Summary](#model-summary)
38
+ 2. [How to Use](#how-to-use)
39
+ 3. [Evaluation](#evaluation)
40
+ 4. [Training](#training)
41
+ 5. [Limitations](#limitations)
42
+ 6. [License](#license)
43
+ 7. [Citation](#citation)
44
+
45
+ ## Model Summary
46
+
47
+ KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm. Weights are released in bfloat16 to make fine-tuning straightforward.
48
+
49
+ | Field | Value |
50
+ |---|---|
51
+ | Parameters | 75,251,200 |
52
+ | Layers | 24 |
53
+ | Hidden size | 512 |
54
+ | Attention heads | 8 (2 KV heads, GQA) |
55
+ | Context length | 2048 |
56
+ | Vocabulary | 12,020 (ByteLevel BPE) |
57
+ | Precision | bfloat16 |
58
+ | Training tokens | ~18B |
59
+
60
+ ## How to Use
61
+
62
+ This is a base model: it continues text and has no chat template. Load it with `trust_remote_code=True` (requires `transformers>=4.51`).
63
+
64
+ ```python
65
+ import torch
66
+ from transformers import AutoModelForCausalLM, AutoTokenizer
67
+
68
+ model_id = "Eclipse-Senpai/KeyLM-75M"
69
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
70
+ model = AutoModelForCausalLM.from_pretrained(
71
+ model_id, trust_remote_code=True, torch_dtype=torch.bfloat16
72
+ )
73
+
74
+ inputs = tokenizer("The three primary colors are", return_tensors="pt")
75
+ outputs = model.generate(
76
+ **inputs, max_new_tokens=40, do_sample=True,
77
+ temperature=0.7, top_p=0.9, repetition_penalty=1.1,
78
+ )
79
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
80
+ ```
81
+
82
+ For fine-tuning, the bfloat16 weights load directly into the usual `transformers` training stack; the model also fine-tunes with assistant-only loss masking under a plain `User:` / `Assistant:` format, which is how the Instruct version was produced.
83
+
84
+ ## Evaluation
85
+
86
+ On standard multiple-choice benchmarks KeyLM performs at or near random chance. This is expected at 75M parameters and 18B tokens: the model holds little parametric knowledge. Scores are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag use length-normalized accuracy).
87
+
88
+ | Model | MMLU | ARC (avg) | HellaSwag | PIQA | WinoGrande | OpenBookQA |
89
+ |---|---|---|---|---|---|---|
90
+ | **KeyLM-75M (base)** | **23.0** | **26.4** | **—** | **52.9** | **48.3** | **19.8** |
91
+ | KeyLM-75M-Instruct | 23.0 | 26.1 | 26.7 | 53.1 | 48.9 | 18.4 |
92
+ | Random baseline | 25.0 | 25.0 | 25.0 | 50.0 | 50.0 | 25.0 |
93
+
94
+ Instruction tuning leaves knowledge and reasoning essentially unchanged; both checkpoints sit close to the random baseline.
95
+
96
+ ## Training
97
+
98
+ KeyLM-75M was pretrained from random initialization on approximately 18B tokens, drawn from a weighted mixture of public datasets streamed through a deterministic curriculum.
99
+
100
+ | Category | Share | Sources |
101
+ |---|---|---|
102
+ | Formal / quality | ~30% | FineWeb-Edu, Wikipedia |
103
+ | Casual / social | ~30% | Reddit comments, StackExchange |
104
+ | Conversational | ~25% | WildChat, UltraChat, LMSYS-Chat, OASST2 |
105
+ | Structured knowledge | ~5% | Cosmopedia |
106
+ | Typo augmentation | ~10% | Synthetic (contrastive) |
107
+
108
+ The instruction-tuned model built on this base is available at [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct).
109
+
110
+ ## Limitations
111
+
112
+ - Minimal world knowledge. Not suitable for factual question answering, reasoning, math, or code.
113
+ - Base model: it completes text and does not follow instructions or hold a conversation. Use the Instruct version for chat.
114
+ - English only.
115
+ - No safety alignment. Apply your own filtering before any user-facing use.
116
+
117
+ ## License
118
+
119
+ Apache 2.0. The weights are trained from scratch and free to use, modify, and redistribute.
120
+
121
+ ## Citation
122
+
123
+ ```bibtex
124
+ @misc{keylm75m2026,
125
+ title = {KeyLM-75M: a from-scratch small language model},
126
+ author = {Eclipse-Senpai},
127
+ year = {2026},
128
+ howpublished = {\url{https://huggingface.co/Eclipse-Senpai/KeyLM-75M}}
129
+ }
130
+ ```
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "KeyLM75M"
4
+ ],
5
+ "model_type": "keylm75m",
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_keylm.KeyLM75MConfig",
8
+ "AutoModelForCausalLM": "modeling_keylm.KeyLM75M"
9
+ },
10
+ "vocab_size": 12020,
11
+ "hidden_size": 512,
12
+ "head_dim": 64,
13
+ "num_attention_heads": 8,
14
+ "num_key_value_heads": 2,
15
+ "intermediate_size": 1280,
16
+ "num_hidden_layers": 24,
17
+ "max_position_embeddings": 2048,
18
+ "rope_theta": 10000.0,
19
+ "rms_norm_eps": 1e-06,
20
+ "hidden_act": "silu",
21
+ "attention_bias": false,
22
+ "attention_dropout": 0.0,
23
+ "use_sliding_window": false,
24
+ "tie_word_embeddings": false,
25
+ "initializer_range": 0.02,
26
+ "bos_token_id": 1,
27
+ "eos_token_id": 2,
28
+ "pad_token_id": 2,
29
+ "torch_dtype": "bfloat16"
30
+ }
configuration_keylm.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """KeyLM model configuration.
2
+
3
+ KeyLM-75M is a from-scratch small language model. Its decoder block is a
4
+ Qwen3-style layout (grouped-query attention, RoPE, SwiGLU, and per-head
5
+ QK-RMSNorm), so the configuration inherits Qwen3Config and only overrides the
6
+ ``model_type`` so the model carries its own identity on the Hub.
7
+ """
8
+
9
+ from transformers.models.qwen3.configuration_qwen3 import Qwen3Config
10
+
11
+
12
+ class KeyLM75MConfig(Qwen3Config):
13
+ model_type = "keylm75m"
generation_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "eos_token_id": 2,
4
+ "pad_token_id": 2
5
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92e276317e548775125f713f98b61d9a9d46723e2cbf804875ce9e668fc2de76
3
+ size 150531928
modeling_keylm.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """KeyLM model implementation.
2
+
3
+ KeyLM-75M uses a Qwen3-style decoder (GQA + RoPE + SwiGLU + per-head
4
+ QK-RMSNorm). Rather than vendor a full copy of the transformer, the classes
5
+ below specialise the upstream Qwen3 implementation and bind it to KeyLMConfig
6
+ so the model loads under its own name via `trust_remote_code=True`.
7
+ """
8
+
9
+ try:
10
+ from transformers.models.qwen3.modeling_qwen3 import Qwen3ForCausalLM, Qwen3Model
11
+ except ImportError as exc: # pragma: no cover - guidance for old transformers
12
+ raise ImportError(
13
+ "KeyLM requires a transformers version that ships the Qwen3 model "
14
+ "(transformers>=4.51). Please upgrade transformers."
15
+ ) from exc
16
+
17
+ from .configuration_keylm import KeyLM75MConfig
18
+
19
+
20
+ class KeyLM75MModel(Qwen3Model):
21
+ config_class = KeyLM75MConfig
22
+
23
+
24
+ class KeyLM75M(Qwen3ForCausalLM):
25
+ config_class = KeyLM75MConfig
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "unk_token": "[UNK]"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "lowercase": false,
5
+ "model_max_length": 2048,
6
+ "tokenizer_class": "PreTrainedTokenizerFast",
7
+ "unk_token": "[UNK]",
8
+ "vocab_size": 12020,
9
+ "add_bos_token": false,
10
+ "add_eos_token": false,
11
+ "clean_up_tokenization_spaces": false
12
+ }