yasserrmd
/

RSCaLM-138M-Core

Model card Files Files and versions

RSCaLM-138M-Core / README.md

yasserrmd's picture

Create README.md

4fa6479 verified 7 months ago

|

history blame contribute delete

3.25 kB

	---
	datasets:
	- HuggingFaceFW/fineweb-edu
	---

	# RSCaLM-138M-core

	RSCaLM (Research Scale Causal Language Model) — Core Edition — is an experimental 138M-parameter decoder-only transformer trained for 20,000 steps.
	Unlike the LLaMA variant, this model is implemented entirely with a custom minimal GPT architecture (`standalone_transformer_lm.GPT`) and SentencePiece tokenization — no Hugging Face Transformers dependency.

	---

	## 📌 Experiment Summary

	* Architecture: Custom GPT-style causal decoder

	* Implemented in `standalone_transformer_lm.py`
	* Learned positional embeddings (absolute)
	* Multi-head self-attention with KV caching
	* GELU feed-forward layers
	* LayerNorm
	* Parameter Count: \~138M
	* Context Length: 2048 tokens
	* Tokenizer: SentencePiece (`tokenizer.model`)
	* Training Framework: Pure PyTorch (no Transformers)
	* Optimizer: AdamW (β1=0.9, β2=0.95, weight decay=0.1)
	* Scheduler: Cosine decay with warmup
	* Precision: Mixed FP16/BF16 training
	* Steps Completed: 20,000 (\~32% of planned total)

	---

	## 📉 Validation Loss Progress

	\| Step \| Val Loss \|
	\| ------ \| -------- \|
	\| 1,000 \| 5.6011 \|
	\| 2,000 \| 4.8598 \|
	\| 5,000 \| 4.2239 \|
	\| 10,000 \| 3.9756 \|
	\| 15,000 \| 3.8608 \|
	\| 20,000 \| 3.7984 \|

	---

	## ⚠️ Notes

	* Prototype only — repetition loops expected in longer generations.
	* Requires `standalone_transformer_lm.py` and SentencePiece to run.
	* Does not load with `transformers.AutoModelForCausalLM`.

	---

	## 🔧 Example Usage

	```python
	import torch, sentencepiece as spm
	from standalone_transformer_lm import GPT, GPTConfig

	# Load checkpoint & config
	ckpt = torch.load("ckpt_best.pt", map_location="cpu")
	cfg = GPTConfig(**ckpt["config"])

	# Init model & load weights
	model = GPT(cfg).eval()
	model.load_state_dict(ckpt["model"])

	# Load tokenizer
	sp = spm.SentencePieceProcessor()
	sp.load("tokenizer.model")

	# Encode prompt
	ids = torch.tensor([sp.encode("Dubai is", out_type=int)])

	# Generate text
	out = model.generate(ids, max_new_tokens=40)
	print(sp.decode(out[0].tolist()))
	```

	---

	## 🔧 Example Usage (with repetition control)

	```python
	import torch, sentencepiece as spm
	from standalone_transformer_lm import GPT, GPTConfig

	ckpt = torch.load("ckpt_best.pt", map_location="cpu")
	cfg = GPTConfig(**ckpt["config"])
	model = GPT(cfg).eval()
	model.load_state_dict(ckpt["model"])

	sp = spm.SentencePieceProcessor()
	sp.load("tokenizer.model")

	prompt = "when a man goes to fishing"
	ids = torch.tensor([sp.encode(prompt, out_type=int)])

	# Manual repetition control
	out = model.generate(
	ids,
	max_new_tokens=100,
	temperature=0.7, # Lower temp = more focused
	top_k=50, # Top-K sampling
	top_p=0.9, # Nucleus sampling
	repetition_penalty=1.2, # Penalize repeats
	no_repeat_ngram_size=3, # Block repeating trigrams
	)
	print(sp.decode(out[0].tolist()))
	```

	---

	### 💡 Tips to Reduce Loops

	* Increase `repetition_penalty` to 1.2–1.5
	* Use `no_repeat_ngram_size=3` or higher
	* Combine `top_k` and `top_p` for better sampling variety
	* Lower `temperature` for more deterministic completions

	---

	## 📜 License

	Apache-2.0

	---