RSCaLM-138M-Core / README.md
yasserrmd's picture
Create README.md
4fa6479 verified
---
datasets:
- HuggingFaceFW/fineweb-edu
---
# RSCaLM-138M-core
**RSCaLM** (**Research Scale Causal Language Model**) β€” *Core Edition* β€” is an **experimental 138M-parameter decoder-only transformer** trained for **20,000 steps**.
Unlike the LLaMA variant, this model is implemented entirely with a **custom minimal GPT architecture** (`standalone_transformer_lm.GPT`) and **SentencePiece** tokenization β€” no Hugging Face Transformers dependency.
---
## πŸ“Œ Experiment Summary
* **Architecture:** Custom GPT-style causal decoder
* Implemented in `standalone_transformer_lm.py`
* Learned positional embeddings (absolute)
* Multi-head self-attention with KV caching
* GELU feed-forward layers
* LayerNorm
* **Parameter Count:** \~138M
* **Context Length:** 2048 tokens
* **Tokenizer:** SentencePiece (`tokenizer.model`)
* **Training Framework:** Pure PyTorch (no Transformers)
* **Optimizer:** AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)
* **Scheduler:** Cosine decay with warmup
* **Precision:** Mixed FP16/BF16 training
* **Steps Completed:** 20,000 (\~32% of planned total)
---
## πŸ“‰ Validation Loss Progress
| Step | Val Loss |
| ------ | -------- |
| 1,000 | 5.6011 |
| 2,000 | 4.8598 |
| 5,000 | 4.2239 |
| 10,000 | 3.9756 |
| 15,000 | 3.8608 |
| 20,000 | 3.7984 |
---
## ⚠️ Notes
* **Prototype only** β€” repetition loops expected in longer generations.
* Requires **`standalone_transformer_lm.py`** and **SentencePiece** to run.
* Does **not** load with `transformers.AutoModelForCausalLM`.
---
## πŸ”§ Example Usage
```python
import torch, sentencepiece as spm
from standalone_transformer_lm import GPT, GPTConfig
# Load checkpoint & config
ckpt = torch.load("ckpt_best.pt", map_location="cpu")
cfg = GPTConfig(**ckpt["config"])
# Init model & load weights
model = GPT(cfg).eval()
model.load_state_dict(ckpt["model"])
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")
# Encode prompt
ids = torch.tensor([sp.encode("Dubai is", out_type=int)])
# Generate text
out = model.generate(ids, max_new_tokens=40)
print(sp.decode(out[0].tolist()))
```
---
## πŸ”§ Example Usage (with repetition control)
```python
import torch, sentencepiece as spm
from standalone_transformer_lm import GPT, GPTConfig
ckpt = torch.load("ckpt_best.pt", map_location="cpu")
cfg = GPTConfig(**ckpt["config"])
model = GPT(cfg).eval()
model.load_state_dict(ckpt["model"])
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")
prompt = "when a man goes to fishing"
ids = torch.tensor([sp.encode(prompt, out_type=int)])
# Manual repetition control
out = model.generate(
ids,
max_new_tokens=100,
temperature=0.7, # Lower temp = more focused
top_k=50, # Top-K sampling
top_p=0.9, # Nucleus sampling
repetition_penalty=1.2, # Penalize repeats
no_repeat_ngram_size=3, # Block repeating trigrams
)
print(sp.decode(out[0].tolist()))
```
---
### πŸ’‘ Tips to Reduce Loops
* Increase `repetition_penalty` to 1.2–1.5
* Use `no_repeat_ngram_size=3` or higher
* Combine `top_k` and `top_p` for better sampling variety
* Lower `temperature` for more deterministic completions
---
## πŸ“œ License
Apache-2.0
---