HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 588k • 1.08k
Task: Text generation (base to instruct & thinking later)
Parameters: 175M
Context window: 2048 tokens
Training data: FineWeb Edu, RefinedWeb, OpenWebMath, TinyCodes, CodeSearchNet, Cosmopedia (total: 7b tokens)
Hyperparams: LR 4e-4, effective batch 126, per_device_batch_size=6, gradient_accumulation_steps=21
Expected loss: train - 2.765, valid - 2.774
Framework: PyTorch + HuggingFace Transformers
How to launch:
import torch
from transformers import PreTrainedTokenizerFast, LlamaForCausalLM
from tokenizers import Tokenizer
tk = PreTrainedTokenizerFast(tokenizer_object=Tokenizer.from_file("tokenizer.json"), bos_token="<|bos|>", eos_token="<|eos|>", unk_token="<|unk|>", pad_token="<|pad|>")
md = LlamaForCausalLM.from_pretrained(".", torch_dtype=torch.bfloat16).to("cuda")
ids = tk.encode("<|bos|>The future of AI is", return_tensors="pt").to("cuda")
gen = md.generate(ids, max_new_tokens=150, do_sample=True, temperature=0.7, repetition_penalty=1.2, no_repeat_ngram_size=3, pad_token_id=0)
print(tk.decode(gen[0], skip_special_tokens=True))