File size: 4,146 Bytes
9fcbc63 f0d4043 9fcbc63 2721ae5 9fcbc63 f0d4043 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
---
language:
- en
library_name: transformers
license: apache-2.0
tags:
- sparknet
- causal-lm
- text-generation
- gpt
- pytorch
- 70m
pipeline_tag: text-generation
model-index:
- name: SparkNet-70M-v5
results: []
datasets:
- codelion/finepdfs-1B
- codelion/dclm-baseline-1B
- codelion/fineweb-edu-1B
---
# SparkNet 70M v5
SparkNet 70M v5 is the final 70M-parameter checkpoint from the SparkNet research run by **DienerTech**. It is a compact GPT-2–style decoder (12 layers, 512 hidden size, 8 attention heads, 1024-token context) that was trained for ~1B tokens on a custom mixture of high-quality web and document corpora. The release ships with the SparkNet v5 tokenizer and weights stored in `model.safetensors`, ready for direct use via 🤗 Transformers.
Special thanks to [CodeLion](https://huggingface.co/codelion) for inspiring the **One Billion Token Challenge**, and for providing the high-quality datasets used in this training run.
## Model Details
- **Developer**: DienerTech
- **Architecture**: GPT-2–style causal decoder (approx. 70M parameters), dropout 0.1, cosine LR schedule, AdamW (fused).
- **Context length**: 1,024 tokens.
- **Tokenizer**: SparkNet v5 byte-level BPE (vocab size 50,257, EOS = `` and `<|pad|>` padding).
- **Framework**: PyTorch / 🤗 Transformers 4.46+.
- **Checkpoint**: Converted to `model.safetensors` for safe loading; no `pytorch_model.bin` left in the repo.
## Intended Use
- Lightweight text generation experiments, story/note drafting, or as a base for instruction-tuning / domain adaptation (LoRA, QLoRA, etc.).
- Research on small-model scaling laws or tokenizer experimentation.
## Limitations & Risks
- No RLHF / instruction tuning; outputs will be generic next-token predictions and may require prompting tricks.
- Training data is predominantly public web/document text, so bias, toxicity, or outdated information may surface.
- Not evaluated for safety-critical deployments—perform your own alignment and red-teaming before production use.
## Training Data
- 1B tokens packed into 1,024-token blocks (`datasets/sparknet-v5-1b`).
- Sources sampled uniformly across: `codelion/finepdfs-1B`, `codelion/dclm-baseline-1B`, `codelion/fineweb-edu-1B`, plus curated DienerTech blog data.
- Validation set: `wikitext-2-raw-v1` (standard Hugging Face split).
## Training Procedure
- **Optimizer**: AdamW (fused) with β₁=0.9, β₂=0.95, weight decay 0.1, gradient clipping at 1.0.
- **Learning rate**: 1e-4 peak with 3% warmup then cosine decay.
- **Batching**: per-device batch size 32, gradient accumulation 2 → 65,536 tokens/step.
- **Budget**: 1,000,000,000 effective tokens (≈15,259 steps).
- **Hardware**: Single 24GB+ NVIDIA GPU with TF32 + Flash Attention enabled.
- **Best checkpoint**: step 14,000 with eval loss 4.99 on WikiText-2 (logged via `trainer_state.json`).
## Evaluation
Formal downstream evaluation has not been run yet. Inside `trainer_state.json`, the best validation (WikiText-2) cross-entropy reached **4.9869** at step 14k. If you benchmark the model (e.g., with lm-eval-harness), please consider contributing results back to the card via a PR.
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "DienerTech/sparknet-70m-v5"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # or torch.float16 on older GPUs
device_map="auto",
)
prompt = "In a distant research lab, a tiny transformer model awakened and"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=120,
temperature=0.9,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
## Citation
```
@software{sparknet70mv5,
author = {DienerTech},
title = {SparkNet 70M v5},
year = {2025},
url = {https://huggingface.co/DienerTech/sparknet-70m-v5}
}
```
Please open an issue or PR on the DienerTech Hugging Face repo if you have feedback, evaluations, or fine-tuned variants to share. |