File size: 3,252 Bytes
4fa6479 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | ---
datasets:
- HuggingFaceFW/fineweb-edu
---
# RSCaLM-138M-core
**RSCaLM** (**Research Scale Causal Language Model**) β *Core Edition* β is an **experimental 138M-parameter decoder-only transformer** trained for **20,000 steps**.
Unlike the LLaMA variant, this model is implemented entirely with a **custom minimal GPT architecture** (`standalone_transformer_lm.GPT`) and **SentencePiece** tokenization β no Hugging Face Transformers dependency.
---
## π Experiment Summary
* **Architecture:** Custom GPT-style causal decoder
* Implemented in `standalone_transformer_lm.py`
* Learned positional embeddings (absolute)
* Multi-head self-attention with KV caching
* GELU feed-forward layers
* LayerNorm
* **Parameter Count:** \~138M
* **Context Length:** 2048 tokens
* **Tokenizer:** SentencePiece (`tokenizer.model`)
* **Training Framework:** Pure PyTorch (no Transformers)
* **Optimizer:** AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)
* **Scheduler:** Cosine decay with warmup
* **Precision:** Mixed FP16/BF16 training
* **Steps Completed:** 20,000 (\~32% of planned total)
---
## π Validation Loss Progress
| Step | Val Loss |
| ------ | -------- |
| 1,000 | 5.6011 |
| 2,000 | 4.8598 |
| 5,000 | 4.2239 |
| 10,000 | 3.9756 |
| 15,000 | 3.8608 |
| 20,000 | 3.7984 |
---
## β οΈ Notes
* **Prototype only** β repetition loops expected in longer generations.
* Requires **`standalone_transformer_lm.py`** and **SentencePiece** to run.
* Does **not** load with `transformers.AutoModelForCausalLM`.
---
## π§ Example Usage
```python
import torch, sentencepiece as spm
from standalone_transformer_lm import GPT, GPTConfig
# Load checkpoint & config
ckpt = torch.load("ckpt_best.pt", map_location="cpu")
cfg = GPTConfig(**ckpt["config"])
# Init model & load weights
model = GPT(cfg).eval()
model.load_state_dict(ckpt["model"])
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")
# Encode prompt
ids = torch.tensor([sp.encode("Dubai is", out_type=int)])
# Generate text
out = model.generate(ids, max_new_tokens=40)
print(sp.decode(out[0].tolist()))
```
---
## π§ Example Usage (with repetition control)
```python
import torch, sentencepiece as spm
from standalone_transformer_lm import GPT, GPTConfig
ckpt = torch.load("ckpt_best.pt", map_location="cpu")
cfg = GPTConfig(**ckpt["config"])
model = GPT(cfg).eval()
model.load_state_dict(ckpt["model"])
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")
prompt = "when a man goes to fishing"
ids = torch.tensor([sp.encode(prompt, out_type=int)])
# Manual repetition control
out = model.generate(
ids,
max_new_tokens=100,
temperature=0.7, # Lower temp = more focused
top_k=50, # Top-K sampling
top_p=0.9, # Nucleus sampling
repetition_penalty=1.2, # Penalize repeats
no_repeat_ngram_size=3, # Block repeating trigrams
)
print(sp.decode(out[0].tolist()))
```
---
### π‘ Tips to Reduce Loops
* Increase `repetition_penalty` to 1.2β1.5
* Use `no_repeat_ngram_size=3` or higher
* Combine `top_k` and `top_p` for better sampling variety
* Lower `temperature` for more deterministic completions
---
## π License
Apache-2.0
---
|