---
datasets:
- HuggingFaceFW/fineweb-edu
---

# RSCaLM-138M-core

**RSCaLM** (**Research Scale Causal Language Model**) — *Core Edition* — is an **experimental 138M-parameter decoder-only transformer** trained for **20,000 steps**.
Unlike the LLaMA variant, this model is implemented entirely with a **custom minimal GPT architecture** (`standalone_transformer_lm.GPT`) and **SentencePiece** tokenization — no Hugging Face Transformers dependency.

---

## 📌 Experiment Summary

* **Architecture:** Custom GPT-style causal decoder

  * Implemented in `standalone_transformer_lm.py`
  * Learned positional embeddings (absolute)
  * Multi-head self-attention with KV caching
  * GELU feed-forward layers
  * LayerNorm
* **Parameter Count:** \~138M
* **Context Length:** 2048 tokens
* **Tokenizer:** SentencePiece (`tokenizer.model`)
* **Training Framework:** Pure PyTorch (no Transformers)
* **Optimizer:** AdamW (β1=0.9, β2=0.95, weight decay=0.1)
* **Scheduler:** Cosine decay with warmup
* **Precision:** Mixed FP16/BF16 training
* **Steps Completed:** 20,000 (\~32% of planned total)

---

## 📉 Validation Loss Progress

| Step   | Val Loss |
| ------ | -------- |
| 1,000  | 5.6011   |
| 2,000  | 4.8598   |
| 5,000  | 4.2239   |
| 10,000 | 3.9756   |
| 15,000 | 3.8608   |
| 20,000 | 3.7984   |

---

## ⚠️ Notes

* **Prototype only** — repetition loops expected in longer generations.
* Requires **`standalone_transformer_lm.py`** and **SentencePiece** to run.
* Does **not** load with `transformers.AutoModelForCausalLM`.

---

## 🔧 Example Usage

```python
import torch, sentencepiece as spm
from standalone_transformer_lm import GPT, GPTConfig

# Load checkpoint & config
ckpt = torch.load("ckpt_best.pt", map_location="cpu")
cfg  = GPTConfig(**ckpt["config"])

# Init model & load weights
model = GPT(cfg).eval()
model.load_state_dict(ckpt["model"])

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")

# Encode prompt
ids = torch.tensor([sp.encode("Dubai is", out_type=int)])

# Generate text
out = model.generate(ids, max_new_tokens=40)
print(sp.decode(out[0].tolist()))
```

---

## 🔧 Example Usage (with repetition control)

```python
import torch, sentencepiece as spm
from standalone_transformer_lm import GPT, GPTConfig

ckpt = torch.load("ckpt_best.pt", map_location="cpu")
cfg  = GPTConfig(**ckpt["config"])
model = GPT(cfg).eval()
model.load_state_dict(ckpt["model"])

sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")

prompt = "when a man goes to fishing"
ids = torch.tensor([sp.encode(prompt, out_type=int)])

# Manual repetition control
out = model.generate(
    ids,
    max_new_tokens=100,
    temperature=0.7,        # Lower temp = more focused
    top_k=50,                # Top-K sampling
    top_p=0.9,               # Nucleus sampling
    repetition_penalty=1.2,  # Penalize repeats
    no_repeat_ngram_size=3,  # Block repeating trigrams
)
print(sp.decode(out[0].tolist()))
```

---

### 💡 Tips to Reduce Loops

* Increase `repetition_penalty` to 1.2–1.5
* Use `no_repeat_ngram_size=3` or higher
* Combine `top_k` and `top_p` for better sampling variety
* Lower `temperature` for more deterministic completions

---

## 📜 License

Apache-2.0

---