Eclipse-Senpai commited on
Commit
ad7e15f
·
verified ·
1 Parent(s): 8dbce4f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -6
README.md CHANGED
@@ -30,7 +30,7 @@ datasets:
30
 
31
  KeyLM-75M is a 75M parameter base language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T).
32
 
33
- This is the **base** model: a text-completion model, not instruction-tuned. It is intended as a starting point for fine-tuning. For chat and instruction following, use [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct).
34
 
35
  ## Table of Contents
36
 
@@ -44,7 +44,7 @@ This is the **base** model: a text-completion model, not instruction-tuned. It i
44
 
45
  ## Model Summary
46
 
47
- KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm. Weights are released in bfloat16 to make fine-tuning straightforward.
48
 
49
  | Field | Value |
50
  |---|---|
@@ -59,8 +59,6 @@ KeyLM is a compact decoder-only transformer built on the standard small-model re
59
 
60
  ## How to Use
61
 
62
- This is a base model: it continues text and has no chat template. Load it with `trust_remote_code=True` (requires `transformers>=4.51`).
63
-
64
  ```python
65
  import torch
66
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -79,8 +77,6 @@ outputs = model.generate(
79
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
80
  ```
81
 
82
- For fine-tuning, the bfloat16 weights load directly into the usual `transformers` training stack; the model also fine-tunes with assistant-only loss masking under a plain `User:` / `Assistant:` format, which is how the Instruct version was produced.
83
-
84
  ## Evaluation
85
 
86
  On standard multiple-choice benchmarks KeyLM performs at or near random chance. This is expected at 75M parameters and 18B tokens: the model holds little parametric knowledge. Scores are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag use length-normalized accuracy).
 
30
 
31
  KeyLM-75M is a 75M parameter base language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T).
32
 
33
+ This is the **base** model: a text-completion model, not instruction-tuned. For chat and instruction following, use [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct).
34
 
35
  ## Table of Contents
36
 
 
44
 
45
  ## Model Summary
46
 
47
+ KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm.
48
 
49
  | Field | Value |
50
  |---|---|
 
59
 
60
  ## How to Use
61
 
 
 
62
  ```python
63
  import torch
64
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
77
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
78
  ```
79
 
 
 
80
  ## Evaluation
81
 
82
  On standard multiple-choice benchmarks KeyLM performs at or near random chance. This is expected at 75M parameters and 18B tokens: the model holds little parametric knowledge. Scores are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag use length-normalized accuracy).