yasserrmd
/

RSCaLM-138M-Core

PyTorch

gptx-min

Model card Files Files and versions

xet

Community

yasserrmd commited on Aug 12, 2025

Commit

4fa6479

verified ·

1 Parent(s): 61f42e4

Create README.md

Browse files

Files changed (1) hide show

README.md +128 -0

README.md ADDED Viewed

	@@ -0,0 +1,128 @@

+---
+datasets:
+- HuggingFaceFW/fineweb-edu
+---
+# RSCaLM-138M-core
+**RSCaLM** (**Research Scale Causal Language Model**) — *Core Edition* — is an **experimental 138M-parameter decoder-only transformer** trained for **20,000 steps**.
+Unlike the LLaMA variant, this model is implemented entirely with a **custom minimal GPT architecture** (`standalone_transformer_lm.GPT`) and **SentencePiece** tokenization — no Hugging Face Transformers dependency.
+---
+## 📌 Experiment Summary
+* **Architecture:** Custom GPT-style causal decoder
+  * Implemented in `standalone_transformer_lm.py`
+  * Learned positional embeddings (absolute)
+  * Multi-head self-attention with KV caching
+  * GELU feed-forward layers
+  * LayerNorm
+* **Parameter Count:** \~138M
+* **Context Length:** 2048 tokens
+* **Tokenizer:** SentencePiece (`tokenizer.model`)
+* **Training Framework:** Pure PyTorch (no Transformers)
+* **Optimizer:** AdamW (β1=0.9, β2=0.95, weight decay=0.1)
+* **Scheduler:** Cosine decay with warmup
+* **Precision:** Mixed FP16/BF16 training
+* **Steps Completed:** 20,000 (\~32% of planned total)
+---
+## 📉 Validation Loss Progress
+| Step   | Val Loss |
+| ------ | -------- |
+| 1,000  | 5.6011   |
+| 2,000  | 4.8598   |
+| 5,000  | 4.2239   |
+| 10,000 | 3.9756   |
+| 15,000 | 3.8608   |
+| 20,000 | 3.7984   |
+---
+## ⚠️ Notes
+* **Prototype only** — repetition loops expected in longer generations.
+* Requires **`standalone_transformer_lm.py`** and **SentencePiece** to run.
+* Does **not** load with `transformers.AutoModelForCausalLM`.
+---
+## 🔧 Example Usage
+```python
+import torch, sentencepiece as spm
+from standalone_transformer_lm import GPT, GPTConfig
+# Load checkpoint & config
+ckpt = torch.load("ckpt_best.pt", map_location="cpu")
+cfg  = GPTConfig(**ckpt["config"])
+# Init model & load weights
+model = GPT(cfg).eval()
+model.load_state_dict(ckpt["model"])
+# Load tokenizer
+sp = spm.SentencePieceProcessor()
+sp.load("tokenizer.model")
+# Encode prompt
+ids = torch.tensor([sp.encode("Dubai is", out_type=int)])
+# Generate text
+out = model.generate(ids, max_new_tokens=40)
+print(sp.decode(out[0].tolist()))
+```
+---
+## 🔧 Example Usage (with repetition control)
+```python
+import torch, sentencepiece as spm
+from standalone_transformer_lm import GPT, GPTConfig
+ckpt = torch.load("ckpt_best.pt", map_location="cpu")
+cfg  = GPTConfig(**ckpt["config"])
+model = GPT(cfg).eval()
+model.load_state_dict(ckpt["model"])
+sp = spm.SentencePieceProcessor()
+sp.load("tokenizer.model")
+prompt = "when a man goes to fishing"
+ids = torch.tensor([sp.encode(prompt, out_type=int)])
+# Manual repetition control
+out = model.generate(
+    ids,
+    max_new_tokens=100,
+    temperature=0.7,        # Lower temp = more focused
+    top_k=50,                # Top-K sampling
+    top_p=0.9,               # Nucleus sampling
+    repetition_penalty=1.2,  # Penalize repeats
+    no_repeat_ngram_size=3,  # Block repeating trigrams
+)
+print(sp.decode(out[0].tolist()))
+```
+---
+### 💡 Tips to Reduce Loops
+* Increase `repetition_penalty` to 1.2–1.5
+* Use `no_repeat_ngram_size=3` or higher
+* Combine `top_k` and `top_p` for better sampling variety
+* Lower `temperature` for more deterministic completions
+---
+## 📜 License
+Apache-2.0
+---