Add KeyLM-75M base model (bf16, from-scratch, ~18B tokens)

Browse files

Files changed (9) hide show

README.md +130 -0
config.json +30 -0
configuration_keylm.py +13 -0
generation_config.json +5 -0
model.safetensors +3 -0
modeling_keylm.py +25 -0
special_tokens_map.json +5 -0
tokenizer.json +0 -0
tokenizer_config.json +12 -0

README.md ADDED Viewed

	@@ -0,0 +1,130 @@

+---
+license: apache-2.0
+language:
+- en
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- keylm
+- small-language-model
+- base
+- pretrained
+- gqa
+- rope
+- swiglu
+- qk-norm
+- custom_code
+datasets:
+- HuggingFaceFW/fineweb-edu-score-2
+- wikimedia/wikipedia
+- HuggingFaceGECLM/REDDIT_comments
+- marin-community/stackexchange-markdown
+- allenai/WildChat-1M
+- HuggingFaceH4/ultrachat_200k
+- lmsys/lmsys-chat-1m
+- OpenAssistant/oasst2
+- HuggingFaceTB/cosmopedia-100k
+---
+# KeyLM-75M
+KeyLM-75M is a 75M parameter base language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T).
+This is the **base** model: a text-completion model, not instruction-tuned. It is intended as a starting point for fine-tuning. For chat and instruction following, use [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct).
+## Table of Contents
+1. [Model Summary](#model-summary)
+2. [How to Use](#how-to-use)
+3. [Evaluation](#evaluation)
+4. [Training](#training)
+5. [Limitations](#limitations)
+6. [License](#license)
+7. [Citation](#citation)
+## Model Summary
+KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm. Weights are released in bfloat16 to make fine-tuning straightforward.
+| Field | Value |
+|---|---|
+| Parameters | 75,251,200 |
+| Layers | 24 |
+| Hidden size | 512 |
+| Attention heads | 8 (2 KV heads, GQA) |
+| Context length | 2048 |
+| Vocabulary | 12,020 (ByteLevel BPE) |
+| Precision | bfloat16 |
+| Training tokens | ~18B |
+## How to Use
+This is a base model: it continues text and has no chat template. Load it with `trust_remote_code=True` (requires `transformers>=4.51`).
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "Eclipse-Senpai/KeyLM-75M"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id, trust_remote_code=True, torch_dtype=torch.bfloat16
+)
+inputs = tokenizer("The three primary colors are", return_tensors="pt")
+outputs = model.generate(
+    **inputs, max_new_tokens=40, do_sample=True,
+    temperature=0.7, top_p=0.9, repetition_penalty=1.1,
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+For fine-tuning, the bfloat16 weights load directly into the usual `transformers` training stack; the model also fine-tunes with assistant-only loss masking under a plain `User:` / `Assistant:` format, which is how the Instruct version was produced.
+## Evaluation
+On standard multiple-choice benchmarks KeyLM performs at or near random chance. This is expected at 75M parameters and 18B tokens: the model holds little parametric knowledge. Scores are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag use length-normalized accuracy).
+| Model | MMLU | ARC (avg) | HellaSwag | PIQA | WinoGrande | OpenBookQA |
+|---|---|---|---|---|---|---|
+| **KeyLM-75M (base)** | **23.0** | **26.4** | **—** | **52.9** | **48.3** | **19.8** |
+| KeyLM-75M-Instruct | 23.0 | 26.1 | 26.7 | 53.1 | 48.9 | 18.4 |
+| Random baseline | 25.0 | 25.0 | 25.0 | 50.0 | 50.0 | 25.0 |
+Instruction tuning leaves knowledge and reasoning essentially unchanged; both checkpoints sit close to the random baseline.
+## Training
+KeyLM-75M was pretrained from random initialization on approximately 18B tokens, drawn from a weighted mixture of public datasets streamed through a deterministic curriculum.
+| Category | Share | Sources |
+|---|---|---|
+| Formal / quality | ~30% | FineWeb-Edu, Wikipedia |
+| Casual / social | ~30% | Reddit comments, StackExchange |
+| Conversational | ~25% | WildChat, UltraChat, LMSYS-Chat, OASST2 |
+| Structured knowledge | ~5% | Cosmopedia |
+| Typo augmentation | ~10% | Synthetic (contrastive) |
+The instruction-tuned model built on this base is available at [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct).
+## Limitations
+- Minimal world knowledge. Not suitable for factual question answering, reasoning, math, or code.
+- Base model: it completes text and does not follow instructions or hold a conversation. Use the Instruct version for chat.
+- English only.
+- No safety alignment. Apply your own filtering before any user-facing use.
+## License
+Apache 2.0. The weights are trained from scratch and free to use, modify, and redistribute.
+## Citation
+```bibtex
+@misc{keylm75m2026,
+  title  = {KeyLM-75M: a from-scratch small language model},
+  author = {Eclipse-Senpai},
+  year   = {2026},
+  howpublished = {\url{https://huggingface.co/Eclipse-Senpai/KeyLM-75M}}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "architectures": [
+    "KeyLM75M"
+  ],
+  "model_type": "keylm75m",
+  "auto_map": {
+    "AutoConfig": "configuration_keylm.KeyLM75MConfig",
+    "AutoModelForCausalLM": "modeling_keylm.KeyLM75M"
+  },
+  "vocab_size": 12020,
+  "hidden_size": 512,
+  "head_dim": 64,
+  "num_attention_heads": 8,
+  "num_key_value_heads": 2,
+  "intermediate_size": 1280,
+  "num_hidden_layers": 24,
+  "max_position_embeddings": 2048,
+  "rope_theta": 10000.0,
+  "rms_norm_eps": 1e-06,
+  "hidden_act": "silu",
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "use_sliding_window": false,
+  "tie_word_embeddings": false,
+  "initializer_range": 0.02,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 2,
+  "torch_dtype": "bfloat16"
+}

configuration_keylm.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""KeyLM model configuration.
+KeyLM-75M is a from-scratch small language model. Its decoder block is a
+Qwen3-style layout (grouped-query attention, RoPE, SwiGLU, and per-head
+QK-RMSNorm), so the configuration inherits Qwen3Config and only overrides the
+``model_type`` so the model carries its own identity on the Hub.
+"""
+from transformers.models.qwen3.configuration_qwen3 import Qwen3Config
+class KeyLM75MConfig(Qwen3Config):
+    model_type = "keylm75m"

generation_config.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 2
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:92e276317e548775125f713f98b61d9a9d46723e2cbf804875ce9e668fc2de76
+size 150531928

modeling_keylm.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""KeyLM model implementation.
+KeyLM-75M uses a Qwen3-style decoder (GQA + RoPE + SwiGLU + per-head
+QK-RMSNorm). Rather than vendor a full copy of the transformer, the classes
+below specialise the upstream Qwen3 implementation and bind it to KeyLMConfig
+so the model loads under its own name via `trust_remote_code=True`.
+"""
+try:
+    from transformers.models.qwen3.modeling_qwen3 import Qwen3ForCausalLM, Qwen3Model
+except ImportError as exc:  # pragma: no cover - guidance for old transformers
+    raise ImportError(
+        "KeyLM requires a transformers version that ships the Qwen3 model "
+        "(transformers>=4.51). Please upgrade transformers."
+    ) from exc
+from .configuration_keylm import KeyLM75MConfig
+class KeyLM75MModel(Qwen3Model):
+    config_class = KeyLM75MConfig
+class KeyLM75M(Qwen3ForCausalLM):
+    config_class = KeyLM75MConfig

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "lowercase": false,
+  "model_max_length": 2048,
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[UNK]",
+  "vocab_size": 12020,
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "clean_up_tokenization_spaces": false
+}