|
|
--- |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb-edu |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
|
|
|
# RSCaLM-138M-LLaMA |
|
|
|
|
|
**RSCaLM** (Research Scale Causal Language Model) is an experimental 138M-parameter LLaMA-architecture model trained for **20,000 steps**. |
|
|
This run was conducted purely for **experimental and benchmarking purposes** β **no high expectations** for downstream task quality. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Experiment Summary |
|
|
|
|
|
* **Architecture:** LLaMA-style causal decoder |
|
|
|
|
|
* Rotary positional embeddings (RoPE) |
|
|
* Pre-normalization with RMSNorm |
|
|
* SwiGLU feed-forward layers |
|
|
* Multi-head self-attention with key-value caching support |
|
|
* **Parameter Count:** \~138M |
|
|
* **Context Length:** 2048 tokens |
|
|
* **Tokenizer:** LLaMA tokenizer |
|
|
* **Training Framework:** PyTorch + Hugging Face Transformers |
|
|
* **Optimizer:** AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1) |
|
|
* **Scheduler:** Cosine decay with warmup |
|
|
* **Precision:** Mixed-precision (FP16/BF16) |
|
|
* **Batching:** Gradient accumulation to simulate large batch size |
|
|
* **Dataset:** General text corpus for pipeline validation (not domain-specific) |
|
|
* **Steps Completed:** 20,000 (\~32% of planned total) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Validation Loss Progress |
|
|
|
|
|
| Step | Val Loss | |
|
|
| ----- | -------- | |
|
|
| 1000 | 5.5968 | |
|
|
| 2000 | 4.8513 | |
|
|
| 5000 | 4.2105 | |
|
|
| 10000 | 3.9603 | |
|
|
| 15000 | 3.8497 | |
|
|
| 20000 | 3.7891 | |
|
|
|
|
|
Loss shows steady improvement over the limited training period. |
|
|
|
|
|
--- |
|
|
|
|
|
## β οΈ Notes |
|
|
|
|
|
* This is an **early prototype** β not tuned for production use. |
|
|
* Training stopped after \~32% of planned total steps. |
|
|
* Possible repetition loops observed in generation β expected for low-step runs. |
|
|
* Intended for research reference, not for deployment in critical tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Example Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_id = "yasserrmd/RSCaLM-138M-LLaMA" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") |
|
|
|
|
|
prompt = "The sun is" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Example Usage (with repetition control) |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_id = "yasserrmd/RSCaLM-138M-LLaMA" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") |
|
|
|
|
|
prompt = "when a man goes to fishing" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
# Generation settings to reduce repetition |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=100, # Limit length of output |
|
|
temperature=0.7, # Lower temperature = more focused |
|
|
top_p=0.9, # Nucleus sampling |
|
|
top_k=50, # Top-K filtering |
|
|
repetition_penalty=1.2, # Penalize repeating tokens |
|
|
no_repeat_ngram_size=3, # Prevent repeating trigrams |
|
|
eos_token_id=tokenizer.eos_token_id, # End generation at EOS |
|
|
) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### π‘ Tips for controlling repetition: |
|
|
|
|
|
1. **`repetition_penalty`** β Increase slightly above `1.0` (e.g., `1.2β1.5`) to discourage repeated phrases. |
|
|
2. **`no_repeat_ngram_size`** β Set to `3` or `4` to avoid repeated n-grams. |
|
|
3. **`top_k` + `top_p`** β Combine both for better randomness control. |
|
|
4. **Lower `temperature`** β Keeps outputs focused and less chaotic. |
|
|
5. **Stop sequences** β Add specific words/phrases to halt generation early if needed. |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
apache-2.0 |