--- datasets: - HuggingFaceFW/fineweb-edu --- # RSCaLM-138M-core **RSCaLM** (**Research Scale Causal Language Model**) — *Core Edition* — is an **experimental 138M-parameter decoder-only transformer** trained for **20,000 steps**. Unlike the LLaMA variant, this model is implemented entirely with a **custom minimal GPT architecture** (`standalone_transformer_lm.GPT`) and **SentencePiece** tokenization — no Hugging Face Transformers dependency. --- ## 📌 Experiment Summary * **Architecture:** Custom GPT-style causal decoder * Implemented in `standalone_transformer_lm.py` * Learned positional embeddings (absolute) * Multi-head self-attention with KV caching * GELU feed-forward layers * LayerNorm * **Parameter Count:** \~138M * **Context Length:** 2048 tokens * **Tokenizer:** SentencePiece (`tokenizer.model`) * **Training Framework:** Pure PyTorch (no Transformers) * **Optimizer:** AdamW (β1=0.9, β2=0.95, weight decay=0.1) * **Scheduler:** Cosine decay with warmup * **Precision:** Mixed FP16/BF16 training * **Steps Completed:** 20,000 (\~32% of planned total) --- ## 📉 Validation Loss Progress | Step | Val Loss | | ------ | -------- | | 1,000 | 5.6011 | | 2,000 | 4.8598 | | 5,000 | 4.2239 | | 10,000 | 3.9756 | | 15,000 | 3.8608 | | 20,000 | 3.7984 | --- ## ⚠️ Notes * **Prototype only** — repetition loops expected in longer generations. * Requires **`standalone_transformer_lm.py`** and **SentencePiece** to run. * Does **not** load with `transformers.AutoModelForCausalLM`. --- ## 🔧 Example Usage ```python import torch, sentencepiece as spm from standalone_transformer_lm import GPT, GPTConfig # Load checkpoint & config ckpt = torch.load("ckpt_best.pt", map_location="cpu") cfg = GPTConfig(**ckpt["config"]) # Init model & load weights model = GPT(cfg).eval() model.load_state_dict(ckpt["model"]) # Load tokenizer sp = spm.SentencePieceProcessor() sp.load("tokenizer.model") # Encode prompt ids = torch.tensor([sp.encode("Dubai is", out_type=int)]) # Generate text out = model.generate(ids, max_new_tokens=40) print(sp.decode(out[0].tolist())) ``` --- ## 🔧 Example Usage (with repetition control) ```python import torch, sentencepiece as spm from standalone_transformer_lm import GPT, GPTConfig ckpt = torch.load("ckpt_best.pt", map_location="cpu") cfg = GPTConfig(**ckpt["config"]) model = GPT(cfg).eval() model.load_state_dict(ckpt["model"]) sp = spm.SentencePieceProcessor() sp.load("tokenizer.model") prompt = "when a man goes to fishing" ids = torch.tensor([sp.encode(prompt, out_type=int)]) # Manual repetition control out = model.generate( ids, max_new_tokens=100, temperature=0.7, # Lower temp = more focused top_k=50, # Top-K sampling top_p=0.9, # Nucleus sampling repetition_penalty=1.2, # Penalize repeats no_repeat_ngram_size=3, # Block repeating trigrams ) print(sp.decode(out[0].tolist())) ``` --- ### 💡 Tips to Reduce Loops * Increase `repetition_penalty` to 1.2–1.5 * Use `no_repeat_ngram_size=3` or higher * Combine `top_k` and `top_p` for better sampling variety * Lower `temperature` for more deterministic completions --- ## 📜 License Apache-2.0 ---