File size: 3,252 Bytes
4fa6479
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
datasets:
- HuggingFaceFW/fineweb-edu
---

# RSCaLM-138M-core

**RSCaLM** (**Research Scale Causal Language Model**) β€” *Core Edition* β€” is an **experimental 138M-parameter decoder-only transformer** trained for **20,000 steps**.
Unlike the LLaMA variant, this model is implemented entirely with a **custom minimal GPT architecture** (`standalone_transformer_lm.GPT`) and **SentencePiece** tokenization β€” no Hugging Face Transformers dependency.

---

## πŸ“Œ Experiment Summary

* **Architecture:** Custom GPT-style causal decoder

  * Implemented in `standalone_transformer_lm.py`
  * Learned positional embeddings (absolute)
  * Multi-head self-attention with KV caching
  * GELU feed-forward layers
  * LayerNorm
* **Parameter Count:** \~138M
* **Context Length:** 2048 tokens
* **Tokenizer:** SentencePiece (`tokenizer.model`)
* **Training Framework:** Pure PyTorch (no Transformers)
* **Optimizer:** AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)
* **Scheduler:** Cosine decay with warmup
* **Precision:** Mixed FP16/BF16 training
* **Steps Completed:** 20,000 (\~32% of planned total)

---

## πŸ“‰ Validation Loss Progress

| Step   | Val Loss |
| ------ | -------- |
| 1,000  | 5.6011   |
| 2,000  | 4.8598   |
| 5,000  | 4.2239   |
| 10,000 | 3.9756   |
| 15,000 | 3.8608   |
| 20,000 | 3.7984   |

---

## ⚠️ Notes

* **Prototype only** β€” repetition loops expected in longer generations.
* Requires **`standalone_transformer_lm.py`** and **SentencePiece** to run.
* Does **not** load with `transformers.AutoModelForCausalLM`.

---

## πŸ”§ Example Usage

```python
import torch, sentencepiece as spm
from standalone_transformer_lm import GPT, GPTConfig

# Load checkpoint & config
ckpt = torch.load("ckpt_best.pt", map_location="cpu")
cfg  = GPTConfig(**ckpt["config"])

# Init model & load weights
model = GPT(cfg).eval()
model.load_state_dict(ckpt["model"])

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")

# Encode prompt
ids = torch.tensor([sp.encode("Dubai is", out_type=int)])

# Generate text
out = model.generate(ids, max_new_tokens=40)
print(sp.decode(out[0].tolist()))
```

---

## πŸ”§ Example Usage (with repetition control)

```python
import torch, sentencepiece as spm
from standalone_transformer_lm import GPT, GPTConfig

ckpt = torch.load("ckpt_best.pt", map_location="cpu")
cfg  = GPTConfig(**ckpt["config"])
model = GPT(cfg).eval()
model.load_state_dict(ckpt["model"])

sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")

prompt = "when a man goes to fishing"
ids = torch.tensor([sp.encode(prompt, out_type=int)])

# Manual repetition control
out = model.generate(
    ids,
    max_new_tokens=100,
    temperature=0.7,        # Lower temp = more focused
    top_k=50,                # Top-K sampling
    top_p=0.9,               # Nucleus sampling
    repetition_penalty=1.2,  # Penalize repeats
    no_repeat_ngram_size=3,  # Block repeating trigrams
)
print(sp.decode(out[0].tolist()))
```

---

### πŸ’‘ Tips to Reduce Loops

* Increase `repetition_penalty` to 1.2–1.5
* Use `no_repeat_ngram_size=3` or higher
* Combine `top_k` and `top_p` for better sampling variety
* Lower `temperature` for more deterministic completions

---

## πŸ“œ License

Apache-2.0

---