Update README.md

cca834b verified 3 days ago

4.3 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- keylm
	- small-language-model
	- base
	- pretrained
	- gqa
	- rope
	- swiglu
	- qk-norm
	- custom_code
	datasets:
	- HuggingFaceFW/fineweb-edu-score-2
	- wikimedia/wikipedia
	- HuggingFaceGECLM/REDDIT_comments
	- marin-community/stackexchange-markdown
	- allenai/WildChat-1M
	- HuggingFaceH4/ultrachat_200k
	- lmsys/lmsys-chat-1m
	- OpenAssistant/oasst2
	- HuggingFaceTB/cosmopedia-100k
	---

	# KeyLM-75M

	KeyLM-75M is a 75M parameter base language model trained from scratch on approximately 18 billion tokens. That training budget is a small fraction of what comparable small models use (SmolLM-135M was trained on roughly 600B tokens, SmolLM2-135M on roughly 2T).

	This is the base model: a text-completion model, not instruction-tuned. For chat and instruction following, use [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct).

	## Table of Contents

	1. [Model Summary](#model-summary)
	2. [How to Use](#how-to-use)
	3. [Evaluation](#evaluation)
	4. [Training](#training)
	5. [Limitations](#limitations)
	6. [License](#license)
	7. [Citation](#citation)

	## Model Summary

	KeyLM is a compact decoder-only transformer built on the standard small-model recipe used by Llama and Qwen3: grouped-query attention, rotary position embeddings (RoPE), SwiGLU feed-forward layers, and per-head QK-RMSNorm.

	\| Field \| Value \|
	\|---\|---\|
	\| Parameters \| 75,251,200 \|
	\| Layers \| 24 \|
	\| Hidden size \| 512 \|
	\| Attention heads \| 8 (2 KV heads, GQA) \|
	\| Context length \| 2048 \|
	\| Vocabulary \| 12,020 (ByteLevel BPE) \|
	\| Precision \| bfloat16 \|
	\| Training tokens \| ~18B \|

	## How to Use

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "Eclipse-Senpai/KeyLM-75M"
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id, trust_remote_code=True, torch_dtype=torch.bfloat16
	)

	inputs = tokenizer("The three primary colors are", return_tensors="pt")
	outputs = model.generate(
	**inputs, max_new_tokens=40, do_sample=True,
	temperature=0.7, top_p=0.9, repetition_penalty=1.1,
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Evaluation

	On zero-shot benchmarks (`lm_eval`; accuracy, with length-normalized accuracy for ARC and HellaSwag) KeyLM is modest but above random on basic commonsense, and at chance on knowledge-heavy tasks.

	\| Benchmark \| KeyLM-75M (base) \| KeyLM-75M-Instruct \| Random \|
	\|---\|---\|---\|---\|
	\| IFEval (4-metric avg) \| — \| 17.85 \| — \|
	\| MMLU \| 23.0 \| 24.0 \| 25.0 \|
	\| ARC (avg) \| 29.9 \| 30.8 \| 25.0 \|
	\| HellaSwag \| 29.7 \| 31.0 \| 25.0 \|
	\| PIQA \| 60.0 \| 61.3 \| 50.0 \|
	\| WinoGrande \| 48.4 \| 48.3 \| 50.0 \|
	\| OpenBookQA \| 25.0 \| 25.0 \| 25.0 \|

	Instruction tuning leaves knowledge and reasoning roughly unchanged, its effect is the instruction-following ability (IFEval) the base lacks.

	## Training

	KeyLM-75M was pretrained from random initialization on approximately 18B tokens, drawn from a weighted mixture of public datasets streamed through a deterministic curriculum.

	\| Category \| Share \| Sources \|
	\|---\|---\|---\|
	\| Formal / quality \| ~30% \| FineWeb-Edu, Wikipedia \|
	\| Casual / social \| ~30% \| Reddit comments, StackExchange \|
	\| Conversational \| ~25% \| WildChat, UltraChat, LMSYS-Chat, OASST2 \|
	\| Structured knowledge \| ~5% \| Cosmopedia \|
	\| Typo augmentation \| ~10% \| Synthetic (contrastive) \|

	The instruction-tuned model built on this base is available at [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct).

	## Limitations

	- Minimal world knowledge. Not suitable for factual question answering, reasoning, math, or code.
	- Base model: it completes text and does not follow instructions or hold a conversation. Use the Instruct version for chat.
	- English only.
	- No safety alignment. Apply your own filtering before any user-facing use.

	## License

	Apache 2.0. The weights are trained from scratch and free to use, modify, and redistribute.

	## Citation

	```bibtex
	@misc{keylm75m2026,
	title = {KeyLM-75M: a from-scratch small language model},
	author = {Eclipse-Senpai},
	year = {2026},
	howpublished = {\url{https://huggingface.co/Eclipse-Senpai/KeyLM-75M}}
	}
	```