| | --- |
| | datasets: |
| | - HuggingFaceFW/fineweb-edu |
| | --- |
| | |
| | # RSCaLM-138M-core |
| |
|
| | **RSCaLM** (**Research Scale Causal Language Model**) β *Core Edition* β is an **experimental 138M-parameter decoder-only transformer** trained for **20,000 steps**. |
| | Unlike the LLaMA variant, this model is implemented entirely with a **custom minimal GPT architecture** (`standalone_transformer_lm.GPT`) and **SentencePiece** tokenization β no Hugging Face Transformers dependency. |
| |
|
| | --- |
| |
|
| | ## π Experiment Summary |
| |
|
| | * **Architecture:** Custom GPT-style causal decoder |
| |
|
| | * Implemented in `standalone_transformer_lm.py` |
| | * Learned positional embeddings (absolute) |
| | * Multi-head self-attention with KV caching |
| | * GELU feed-forward layers |
| | * LayerNorm |
| | * **Parameter Count:** \~138M |
| | * **Context Length:** 2048 tokens |
| | * **Tokenizer:** SentencePiece (`tokenizer.model`) |
| | * **Training Framework:** Pure PyTorch (no Transformers) |
| | * **Optimizer:** AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1) |
| | * **Scheduler:** Cosine decay with warmup |
| | * **Precision:** Mixed FP16/BF16 training |
| | * **Steps Completed:** 20,000 (\~32% of planned total) |
| |
|
| | --- |
| |
|
| | ## π Validation Loss Progress |
| |
|
| | | Step | Val Loss | |
| | | ------ | -------- | |
| | | 1,000 | 5.6011 | |
| | | 2,000 | 4.8598 | |
| | | 5,000 | 4.2239 | |
| | | 10,000 | 3.9756 | |
| | | 15,000 | 3.8608 | |
| | | 20,000 | 3.7984 | |
| |
|
| | --- |
| |
|
| | ## β οΈ Notes |
| |
|
| | * **Prototype only** β repetition loops expected in longer generations. |
| | * Requires **`standalone_transformer_lm.py`** and **SentencePiece** to run. |
| | * Does **not** load with `transformers.AutoModelForCausalLM`. |
| |
|
| | --- |
| |
|
| | ## π§ Example Usage |
| |
|
| | ```python |
| | import torch, sentencepiece as spm |
| | from standalone_transformer_lm import GPT, GPTConfig |
| | |
| | # Load checkpoint & config |
| | ckpt = torch.load("ckpt_best.pt", map_location="cpu") |
| | cfg = GPTConfig(**ckpt["config"]) |
| | |
| | # Init model & load weights |
| | model = GPT(cfg).eval() |
| | model.load_state_dict(ckpt["model"]) |
| | |
| | # Load tokenizer |
| | sp = spm.SentencePieceProcessor() |
| | sp.load("tokenizer.model") |
| | |
| | # Encode prompt |
| | ids = torch.tensor([sp.encode("Dubai is", out_type=int)]) |
| | |
| | # Generate text |
| | out = model.generate(ids, max_new_tokens=40) |
| | print(sp.decode(out[0].tolist())) |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π§ Example Usage (with repetition control) |
| |
|
| | ```python |
| | import torch, sentencepiece as spm |
| | from standalone_transformer_lm import GPT, GPTConfig |
| | |
| | ckpt = torch.load("ckpt_best.pt", map_location="cpu") |
| | cfg = GPTConfig(**ckpt["config"]) |
| | model = GPT(cfg).eval() |
| | model.load_state_dict(ckpt["model"]) |
| | |
| | sp = spm.SentencePieceProcessor() |
| | sp.load("tokenizer.model") |
| | |
| | prompt = "when a man goes to fishing" |
| | ids = torch.tensor([sp.encode(prompt, out_type=int)]) |
| | |
| | # Manual repetition control |
| | out = model.generate( |
| | ids, |
| | max_new_tokens=100, |
| | temperature=0.7, # Lower temp = more focused |
| | top_k=50, # Top-K sampling |
| | top_p=0.9, # Nucleus sampling |
| | repetition_penalty=1.2, # Penalize repeats |
| | no_repeat_ngram_size=3, # Block repeating trigrams |
| | ) |
| | print(sp.decode(out[0].tolist())) |
| | ``` |
| |
|
| | --- |
| |
|
| | ### π‘ Tips to Reduce Loops |
| |
|
| | * Increase `repetition_penalty` to 1.2β1.5 |
| | * Use `no_repeat_ngram_size=3` or higher |
| | * Combine `top_k` and `top_p` for better sampling variety |
| | * Lower `temperature` for more deterministic completions |
| |
|
| | --- |
| |
|
| | ## π License |
| |
|
| | Apache-2.0 |
| |
|
| | --- |
| |
|
| |
|