File size: 3,658 Bytes
5be2b87 7ddedd1 5be2b87 f1c3e65 5be2b87 f1c3e65 5be2b87 f1c3e65 5be2b87 7b4ae2f 5be2b87 7ddedd1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
---
datasets:
- HuggingFaceFW/fineweb-edu
license: apache-2.0
---
# RSCaLM-138M-LLaMA
**RSCaLM** (Research Scale Causal Language Model) is an experimental 138M-parameter LLaMA-architecture model trained for **20,000 steps**.
This run was conducted purely for **experimental and benchmarking purposes** β **no high expectations** for downstream task quality.
---
## π Experiment Summary
* **Architecture:** LLaMA-style causal decoder
* Rotary positional embeddings (RoPE)
* Pre-normalization with RMSNorm
* SwiGLU feed-forward layers
* Multi-head self-attention with key-value caching support
* **Parameter Count:** \~138M
* **Context Length:** 2048 tokens
* **Tokenizer:** LLaMA tokenizer
* **Training Framework:** PyTorch + Hugging Face Transformers
* **Optimizer:** AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)
* **Scheduler:** Cosine decay with warmup
* **Precision:** Mixed-precision (FP16/BF16)
* **Batching:** Gradient accumulation to simulate large batch size
* **Dataset:** General text corpus for pipeline validation (not domain-specific)
* **Steps Completed:** 20,000 (\~32% of planned total)
---
## π Validation Loss Progress
| Step | Val Loss |
| ----- | -------- |
| 1000 | 5.5968 |
| 2000 | 4.8513 |
| 5000 | 4.2105 |
| 10000 | 3.9603 |
| 15000 | 3.8497 |
| 20000 | 3.7891 |
Loss shows steady improvement over the limited training period.
---
## β οΈ Notes
* This is an **early prototype** β not tuned for production use.
* Training stopped after \~32% of planned total steps.
* Possible repetition loops observed in generation β expected for low-step runs.
* Intended for research reference, not for deployment in critical tasks.
---
## π§ Example Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "yasserrmd/RSCaLM-138M-LLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "The sun is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## π§ Example Usage (with repetition control)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "yasserrmd/RSCaLM-138M-LLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "when a man goes to fishing"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generation settings to reduce repetition
outputs = model.generate(
**inputs,
max_new_tokens=100, # Limit length of output
temperature=0.7, # Lower temperature = more focused
top_p=0.9, # Nucleus sampling
top_k=50, # Top-K filtering
repetition_penalty=1.2, # Penalize repeating tokens
no_repeat_ngram_size=3, # Prevent repeating trigrams
eos_token_id=tokenizer.eos_token_id, # End generation at EOS
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
### π‘ Tips for controlling repetition:
1. **`repetition_penalty`** β Increase slightly above `1.0` (e.g., `1.2β1.5`) to discourage repeated phrases.
2. **`no_repeat_ngram_size`** β Set to `3` or `4` to avoid repeated n-grams.
3. **`top_k` + `top_p`** β Combine both for better randomness control.
4. **Lower `temperature`** β Keeps outputs focused and less chaotic.
5. **Stop sequences** β Add specific words/phrases to halt generation early if needed.
---
## π License
apache-2.0 |