--- language: - en library_name: transformers license: apache-2.0 tags: - sparknet - causal-lm - text-generation - gpt - pytorch - 70m pipeline_tag: text-generation model-index: - name: SparkNet-70M-v5 results: [] datasets: - codelion/finepdfs-1B - codelion/dclm-baseline-1B - codelion/fineweb-edu-1B --- # SparkNet 70M v5 SparkNet 70M v5 is the final 70M-parameter checkpoint from the SparkNet research run by **DienerTech**. It is a compact GPT-2–style decoder (12 layers, 512 hidden size, 8 attention heads, 1024-token context) that was trained for ~1B tokens on a custom mixture of high-quality web and document corpora. The release ships with the SparkNet v5 tokenizer and weights stored in `model.safetensors`, ready for direct use via 🤗 Transformers. Special thanks to [CodeLion](https://huggingface.co/codelion) for inspiring the **One Billion Token Challenge**, and for providing the high-quality datasets used in this training run. ## Model Details - **Developer**: DienerTech - **Architecture**: GPT-2–style causal decoder (approx. 70M parameters), dropout 0.1, cosine LR schedule, AdamW (fused). - **Context length**: 1,024 tokens. - **Tokenizer**: SparkNet v5 byte-level BPE (vocab size 50,257, EOS = `` and `<|pad|>` padding). - **Framework**: PyTorch / 🤗 Transformers 4.46+. - **Checkpoint**: Converted to `model.safetensors` for safe loading; no `pytorch_model.bin` left in the repo. ## Intended Use - Lightweight text generation experiments, story/note drafting, or as a base for instruction-tuning / domain adaptation (LoRA, QLoRA, etc.). - Research on small-model scaling laws or tokenizer experimentation. ## Limitations & Risks - No RLHF / instruction tuning; outputs will be generic next-token predictions and may require prompting tricks. - Training data is predominantly public web/document text, so bias, toxicity, or outdated information may surface. - Not evaluated for safety-critical deployments—perform your own alignment and red-teaming before production use. ## Training Data - 1B tokens packed into 1,024-token blocks (`datasets/sparknet-v5-1b`). - Sources sampled uniformly across: `codelion/finepdfs-1B`, `codelion/dclm-baseline-1B`, `codelion/fineweb-edu-1B`, plus curated DienerTech blog data. - Validation set: `wikitext-2-raw-v1` (standard Hugging Face split). ## Training Procedure - **Optimizer**: AdamW (fused) with β₁=0.9, β₂=0.95, weight decay 0.1, gradient clipping at 1.0. - **Learning rate**: 1e-4 peak with 3% warmup then cosine decay. - **Batching**: per-device batch size 32, gradient accumulation 2 → 65,536 tokens/step. - **Budget**: 1,000,000,000 effective tokens (≈15,259 steps). - **Hardware**: Single 24GB+ NVIDIA GPU with TF32 + Flash Attention enabled. - **Best checkpoint**: step 14,000 with eval loss 4.99 on WikiText-2 (logged via `trainer_state.json`). ## Evaluation Formal downstream evaluation has not been run yet. Inside `trainer_state.json`, the best validation (WikiText-2) cross-entropy reached **4.9869** at step 14k. If you benchmark the model (e.g., with lm-eval-harness), please consider contributing results back to the card via a PR. ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "DienerTech/sparknet-70m-v5" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, # or torch.float16 on older GPUs device_map="auto", ) prompt = "In a distant research lab, a tiny transformer model awakened and" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) output = model.generate( **inputs, max_new_tokens=120, temperature=0.9, top_p=0.9, do_sample=True, ) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Citation ``` @software{sparknet70mv5, author = {DienerTech}, title = {SparkNet 70M v5}, year = {2025}, url = {https://huggingface.co/DienerTech/sparknet-70m-v5} } ``` Please open an issue or PR on the DienerTech Hugging Face repo if you have feedback, evaluations, or fine-tuned variants to share.