genesis-i-model / README.md

njantether

Update README.md (#4)

6bc6b8e verified about 2 months ago

9.53 kB

	---
	license: apache-2.0
	tags:
	- base-model
	- causal-lm
	- qwen3
	- transformer
	language:
	- en
	pipeline_tag: text-generation
	---

	# QVAC Genesis I Pretrained Model

	## Key Highlights
	- Pretrained on the Largest Synthetic Educational Dataset
	This model has been pretrained on Tether's QVAC Genesis I, the largest synthetic dataset released for educational LLM pre-training.

	The model was trained from scratch on approximately 40B tokens of multi-domain educational text, using BF16 mixed precision and a 4,096-token context window. Training was made with a Qwen3-family 1.7B-parameter decoder-only transformer architecture.

	Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning.

	- Multi-Domain Educational Coverage
	Because the model is trained on QVAC Genesis I, it inherits curriculum-aligned coverage across:
	- Mathematics
	- Physics
	- Biology
	- Medicine

	- Superior Benchmark Performance
	Leveraging QVAC Genesis I as its training foundation, the model consistently outperforms baselines in:
	- Reasoning tasks
	- Knowledge assessments
	- Subject-specific QA

	- First Publicly Released Education-Specific Pretrained Model
	This is the first open-source pretrained model built directly on a rigorously validated synthetic dataset for education, offering deep and comprehensive STEM coverage.
	abilities

	## Intended Uses
	- Continual pre-training or fine-tuning for educational applications (STEM-focused tutoring, QA systems, curriculum support)
	- Benchmarking reasoning and subject-specific QA performance
	- Research into synthetic dataset–driven LLM training

	---

	## Model Details

	### Model Description

	- Developed by: Qvac by Tether
	- Model type: Decoder-only Transformer (causal LM)
	- Language(s) (NLP): Primarily English
	- License: Apache-2.0
	- Finetuned from model: None (trained from scratch)
	- Intended stage: Base pre-trained model (no SFT / RLHF alignment)

	### Dataset Details

	- Repository: https://huggingface.co/qvac/genesisI-model
	- Paper / Blog : https://huggingface.co/blog/qvac/genesis-i

	---

	## Uses

	### Direct Use

	- General language modeling: next-token prediction, continuation, summarization, drafting.
	- Research baseline for scaling, data ablations, or tokenizer studies.

	### Downstream Use (recommended)
	- CPT Continued Pre-Training on more tokens.
	- SFT for assistants, domain experts, or task-specific models.
	- Preference optimization / RLHF for safer, more helpful behavior.
	- Adapters/LoRA for efficient domain specialization.

	### Out-of-Scope Use

	- High-stakes decision-making (medical/financial/legal).
	- Safety-critical or autonomous control systems.
	- Unfiltered end-user chat deployment without alignment / safety layers.
	- Any use that violates applicable laws or platform policies.

	---

	## Bias, Risks, and Limitations

	- Bias & toxicity: May reflect or amplify biases present in web text.
	- Hallucinations: Can produce confident but incorrect statements or citations.
	- Security / privacy: May emit continous random strings.
	- Context limit: 4,096 tokens; longer inputs require chunking.

	### Recommendations

	- Disclose limitations to downstream users.
	- Research Model : Not to be used in production use cases.

	---

	## How to Get Started

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "qvac/genesisI-model"

	tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16, # trained with BF16 mixed precision
	device_map="auto"
	)

	prompt = "Explain precision vs. recall in one paragraph."
	inputs = tok(prompt, return_tensors="pt").to(model.device)
	out = model.generate(
	**inputs,
	max_new_tokens=256,
	do_sample=True,
	top_p=0.9,
	temperature=0.7
	)
	print(tok.decode(out[0], skip_special_tokens=True))
	````

	Tip: On consumer GPUs, consider loading in `float16` or using 4/8-bit quantization (e.g., bitsandbytes/AutoGPTQ).

	---

	## Training Details

	### Training Data

	* Size: ~40B tokens, single epoch.
	* Domains: Mixed general + STEM/technical sources (expository text, problem sets, references).
	* Format: Hugging Face Datasets (Arrow).
	* Tokenizer: Qwen3 tokenizer.
	* Processing: Normalization, filtering of extremes, document chunking to fit 4096 context, sequence packing where applicable.
	* Dataset Card: Coming Soon

	### Training Procedure

	#### Preprocessing

	* Unicode normalization, whitespace cleanup, control-char stripping.
	* Length filtering; chunking to 4096; optional packing to improve throughput.

	#### Training Hyperparameters

	* Optimizer: AdamW (β₁=0.9, β₂=0.95), weight decay 0.01
	* Learning rate: 2e-4 (linear warmup)
	* Warmup: 600 steps (~10% of max steps)
	* Precision: BF16 mixed precision
	* Gradient clipping: 1.0
	* Seed: 42
	* Logging: Every 50 steps
	* Eval: Every 500 steps (20 iters)
	* Checkpointing: Every 1000 steps (sharded; full optimizer/state resume)

	#### Speeds, Sizes, Times

	* Per-GPU micro-batch: 4
	* Grad accumulation: 8
	* World size: 480 GPUs
	* Effective global batch: `4 × 8 × 480 = 15,360` samples/step
	* Step time (indicative): ~1.5 s/step (cluster/I-O dependent)

	#### Stability & Performance

	* Activation checkpointing.
	* Fused kernels where available (fused attention/optimizer).
	* FlashAttention-2 on H100.
	* `torch.compile` (safe mode) after warmup stability.
	* Dynamic loss scaling to mitigate BF16 overflow.
	* Fragmentation mitigations (e.g., `max_split_size_mb=512`, expandable segments, GC threshold ~0.8).

	---

	## Multi-Node GPU Setup

	* Cluster: ~60 nodes, each 8× NVIDIA H100 80GB (total 480 GPUs), ~800 GB RAM/node.
	* Scheduler: Slurm (priority partition, exclusive allocation, 72-hour limit).
	* Launch: `srun` + PyTorch DDP (world size 480; ranks bound via Slurm env).
	* Storage: Sharded checkpoints; periodic saves for robust resume.
	* Networking: NCCL over InfiniBand with UCX

	* `NCCL_IB_DISABLE=0`, `NCCL_IB_HCA="mlx5*"`, `NCCL_SOCKET_IFNAME=<ib0/enoX>`, `NCCL_BLOCKING_WAIT=1`
	* Watchdog ~720s for fail-fast on fabric issues
	* I/O: Async dataset prefetching; pinned FS threads.
	* Observability: W&B + structured logs (throughput, TFLOPs/GPU, mem, step time).
	* Reproducibility: Fixed seeds; exact launch scripts/env logged; effective tokens/step reported.

	> Final checkpoint converted to Hugging Face format for plug-and-play inference.

	---

	## Evaluation

	### Testing Data, Factors & Metrics

	* Testing data: Standard academic suites (e.g., EleutherAI LM Evaluation Harness).
	* Factors: Domain/topic (STEM vs. general), task type (multi-choice vs. open-ended).
	* Metrics: Accuracy (MCQ), EM/F1 (QA), plus task-native metrics.

	Suggested suite (edit as applicable):

	* General knowledge & reasoning: MMLU (STEM subsets), ARC-E/ARC-C, HellaSwag, PIQA, Winogrande
	* Math/coding (optional): GSM8K, HumanEval
	* Reading comprehension (optional): BoolQ, RACE

	### Results

	* To be released with an evaluated checkpoint and harness version pin.
	Include tables with exact versions, seeds, and commit hashes.

	#### Summary

	* Base LM targets broad generalization at 41B tokens.
	* Expect material gains after SFT + preference optimization for target tasks.

	---

	## Technical Specifications

	### Model Architecture and Objective

	* Architecture: Qwen3-style decoder-only Transformer
	* Parameters: ~1.7B
	* Context length: 4,096 tokens
	* Positional encoding: Rotary / relative (specify)
	* Attention: Multi-head scaled dot-product; FlashAttention-2 enabled on H100
	* Activation: GELU / SiLU (specify)
	* Norms: RMSNorm / LayerNorm (specify)
	* Objective: Causal LM (next-token prediction)

	### Compute Infrastructure

	Hardware

	* 60 nodes × 8× H100 80GB, ~800 GB RAM/node, InfiniBand fabric.

	Software

	* PyTorch ≥ 2.1 (CUDA 12.x), FlashAttention-2, UCX/NCCL
	* Slurm for orchestration; W&B for logging
	* (Optional) DeepSpeed/Zero-3 for training; HF conversion post-train

	---

	## Reproducibility (Launch Sketch)

	```bash
	# Slurm (illustrative)
	srun -N 60 -n 480 --ntasks-per-node=8 --gpus-per-task=1 \
	--cpus-per-task=8 --mem=0 \
	bash -lc '
	export NCCL_IB_DISABLE=0
	export NCCL_IB_HCA="mlx5*"
	export NCCL_SOCKET_IFNAME=ib0
	export NCCL_BLOCKING_WAIT=1
	export TORCH_DISTRIBUTED_DEBUG=DETAIL

	python train.py \
	--model qwen3_1p7b_from_scratch \
	--tokenizer qwen3 \
	--data_path /path/to/arrow \
	--context_length 4096 \
	--optimizer adamw --weight_decay 0.01 \
	--lr 2e-4 --warmup_steps 600 \
	--precision bf16-mixed \
	--micro_batch_size 4 \
	--grad_accum_steps 8 \
	--eval_every 500 --log_every 50 \
	--ckpt_every 1000 \
	--activation_checkpointing \
	--flash_attn 2 \
	--compile safe \
	--seed 42
	'
	```

	---

	## Conversion & Inference

	* Checkpoints are HF-compatible: load with `AutoModelForCausalLM`.
	* For memory-limited environments, prefer half-precision or 4/8-bit loading.
	* Distribute as `safetensors` for integrity.

	---


	## Changelog

	* v0.1 (2025-11-17): Initial public release — 40B-token 1-epoch pretrain; HF conversion.