Eclipse-Senpai
/

KeyLM-75M

@@ -30,7 +30,7 @@ datasets:
 KeyLM-75M is a 75M parameter base language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T).
-This is the **base** model: a text-completion model, not instruction-tuned. It is intended as a starting point for fine-tuning. For chat and instruction following, use [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct).
 ## Table of Contents
@@ -44,7 +44,7 @@ This is the **base** model: a text-completion model, not instruction-tuned. It i
 ## Model Summary
-KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm. Weights are released in bfloat16 to make fine-tuning straightforward.
 | Field | Value |
 |---|---|
@@ -59,8 +59,6 @@ KeyLM is a compact decoder-only transformer built on the standard small-model re
 ## How to Use
-This is a base model: it continues text and has no chat template. Load it with `trust_remote_code=True` (requires `transformers>=4.51`).
 ```python
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -79,8 +77,6 @@ outputs = model.generate(
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-For fine-tuning, the bfloat16 weights load directly into the usual `transformers` training stack; the model also fine-tunes with assistant-only loss masking under a plain `User:` / `Assistant:` format, which is how the Instruct version was produced.
 ## Evaluation
 On standard multiple-choice benchmarks KeyLM performs at or near random chance. This is expected at 75M parameters and 18B tokens: the model holds little parametric knowledge. Scores are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag use length-normalized accuracy).

 KeyLM-75M is a 75M parameter base language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T).
+This is the **base** model: a text-completion model, not instruction-tuned. For chat and instruction following, use [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct).
 ## Table of Contents
 ## Model Summary
+KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm.
 | Field | Value |
 |---|---|
 ## How to Use
 ```python
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ## Evaluation
 On standard multiple-choice benchmarks KeyLM performs at or near random chance. This is expected at 75M parameters and 18B tokens: the model holds little parametric knowledge. Scores are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag use length-normalized accuracy).