File size: 3,658 Bytes
5be2b87
 
 
7ddedd1
5be2b87
 
 
 
 
 
 
 
 
 
 
 
 
f1c3e65
 
 
 
 
5be2b87
f1c3e65
5be2b87
f1c3e65
 
 
 
 
 
 
5be2b87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b4ae2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5be2b87
7ddedd1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
datasets:
- HuggingFaceFW/fineweb-edu
license: apache-2.0
---


# RSCaLM-138M-LLaMA

**RSCaLM** (Research Scale Causal Language Model) is an experimental 138M-parameter LLaMA-architecture model trained for **20,000 steps**.
This run was conducted purely for **experimental and benchmarking purposes** β€” **no high expectations** for downstream task quality.

---

## πŸ“Œ Experiment Summary

* **Architecture:** LLaMA-style causal decoder

  * Rotary positional embeddings (RoPE)
  * Pre-normalization with RMSNorm
  * SwiGLU feed-forward layers
  * Multi-head self-attention with key-value caching support
* **Parameter Count:** \~138M
* **Context Length:** 2048 tokens
* **Tokenizer:** LLaMA tokenizer
* **Training Framework:** PyTorch + Hugging Face Transformers
* **Optimizer:** AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)
* **Scheduler:** Cosine decay with warmup
* **Precision:** Mixed-precision (FP16/BF16)
* **Batching:** Gradient accumulation to simulate large batch size
* **Dataset:** General text corpus for pipeline validation (not domain-specific)
* **Steps Completed:** 20,000 (\~32% of planned total)

---

## πŸ“‰ Validation Loss Progress

| Step  | Val Loss |
| ----- | -------- |
| 1000  | 5.5968   |
| 2000  | 4.8513   |
| 5000  | 4.2105   |
| 10000 | 3.9603   |
| 15000 | 3.8497   |
| 20000 | 3.7891   |

Loss shows steady improvement over the limited training period.

---

## ⚠️ Notes

* This is an **early prototype** β€” not tuned for production use.
* Training stopped after \~32% of planned total steps.
* Possible repetition loops observed in generation β€” expected for low-step runs.
* Intended for research reference, not for deployment in critical tasks.

---

## πŸ”§ Example Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "yasserrmd/RSCaLM-138M-LLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "The sun is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## πŸ”§ Example Usage (with repetition control)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "yasserrmd/RSCaLM-138M-LLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "when a man goes to fishing"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generation settings to reduce repetition
outputs = model.generate(
    **inputs,
    max_new_tokens=100,        # Limit length of output
    temperature=0.7,           # Lower temperature = more focused
    top_p=0.9,                  # Nucleus sampling
    top_k=50,                   # Top-K filtering
    repetition_penalty=1.2,     # Penalize repeating tokens
    no_repeat_ngram_size=3,     # Prevent repeating trigrams
    eos_token_id=tokenizer.eos_token_id,  # End generation at EOS
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

### πŸ’‘ Tips for controlling repetition:

1. **`repetition_penalty`** – Increase slightly above `1.0` (e.g., `1.2–1.5`) to discourage repeated phrases.
2. **`no_repeat_ngram_size`** – Set to `3` or `4` to avoid repeated n-grams.
3. **`top_k` + `top_p`** – Combine both for better randomness control.
4. **Lower `temperature`** – Keeps outputs focused and less chaotic.
5. **Stop sequences** – Add specific words/phrases to halt generation early if needed.

---

## πŸ“œ License
apache-2.0