genesis-i-model / README.md
njantether
Update README.md (#4)
6bc6b8e verified
---
license: apache-2.0
tags:
- base-model
- causal-lm
- qwen3
- transformer
language:
- en
pipeline_tag: text-generation
---
# QVAC Genesis I Pretrained Model
## Key Highlights
- **Pretrained on the Largest Synthetic Educational Dataset**
This model has been **pretrained on Tether's QVAC Genesis I**, the largest synthetic dataset released for educational LLM pre-training.
The model was trained **from scratch** on approximately **40B tokens** of multi-domain educational text, using **BF16 mixed precision** and a **4,096-token context window**. Training was made with a **Qwen3-family 1.7B-parameter decoder-only transformer** architecture.
Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning.
- **Multi-Domain Educational Coverage**
Because the model is trained on QVAC Genesis I, it inherits curriculum-aligned coverage across:
- Mathematics
- Physics
- Biology
- Medicine
- **Superior Benchmark Performance**
Leveraging QVAC Genesis I as its training foundation, the model consistently outperforms baselines in:
- Reasoning tasks
- Knowledge assessments
- Subject-specific QA
- **First Publicly Released Education-Specific Pretrained Model**
This is the first open-source pretrained model built directly on a rigorously validated synthetic dataset for education, offering deep and comprehensive STEM coverage.
abilities
## Intended Uses
- Continual pre-training or fine-tuning for educational applications (STEM-focused tutoring, QA systems, curriculum support)
- Benchmarking reasoning and subject-specific QA performance
- Research into synthetic dataset–driven LLM training
---
## Model Details
### Model Description
- **Developed by:** Qvac by Tether
- **Model type:** Decoder-only Transformer (causal LM)
- **Language(s) (NLP):** Primarily English
- **License:** Apache-2.0
- **Finetuned from model:** **None (trained from scratch)**
- **Intended stage:** **Base pre-trained model** (no SFT / RLHF alignment)
### Dataset Details
- **Repository:** https://huggingface.co/qvac/genesisI-model
- **Paper / Blog :** https://huggingface.co/blog/qvac/genesis-i
---
## Uses
### Direct Use
- General language modeling: next-token prediction, continuation, summarization, drafting.
- Research baseline for scaling, data ablations, or tokenizer studies.
### Downstream Use (recommended)
- **CPT** Continued Pre-Training on more tokens.
- **SFT** for assistants, domain experts, or task-specific models.
- **Preference optimization / RLHF** for safer, more helpful behavior.
- **Adapters/LoRA** for efficient domain specialization.
### Out-of-Scope Use
- High-stakes decision-making (medical/financial/legal).
- Safety-critical or autonomous control systems.
- Unfiltered end-user chat deployment without alignment / safety layers.
- Any use that violates applicable laws or platform policies.
---
## Bias, Risks, and Limitations
- **Bias & toxicity:** May reflect or amplify biases present in web text.
- **Hallucinations:** Can produce confident but incorrect statements or citations.
- **Security / privacy:** May emit continous random strings.
- **Context limit:** 4,096 tokens; longer inputs require chunking.
### Recommendations
- Disclose limitations to downstream users.
- Research Model : Not to be used in production use cases.
---
## How to Get Started
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "qvac/genesisI-model"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # trained with BF16 mixed precision
device_map="auto"
)
prompt = "Explain precision vs. recall in one paragraph."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
top_p=0.9,
temperature=0.7
)
print(tok.decode(out[0], skip_special_tokens=True))
````
*Tip: On consumer GPUs, consider loading in `float16` or using 4/8-bit quantization (e.g., bitsandbytes/AutoGPTQ).*
---
## Training Details
### Training Data
* **Size:** ~**40B tokens**, single epoch.
* **Domains:** Mixed general + STEM/technical sources (expository text, problem sets, references).
* **Format:** Hugging Face Datasets (Arrow).
* **Tokenizer:** **Qwen3** tokenizer.
* **Processing:** Normalization, filtering of extremes, document chunking to fit **4096** context, sequence packing where applicable.
* **Dataset Card:** *Coming Soon*
### Training Procedure
#### Preprocessing
* Unicode normalization, whitespace cleanup, control-char stripping.
* Length filtering; chunking to 4096; optional packing to improve throughput.
#### Training Hyperparameters
* **Optimizer:** AdamW (β₁=0.9, β₂=0.95), **weight decay 0.01**
* **Learning rate:** **2e-4** (linear warmup)
* **Warmup:** **600** steps (~10% of max steps)
* **Precision:** **BF16 mixed precision**
* **Gradient clipping:** **1.0**
* **Seed:** **42**
* **Logging:** Every **50** steps
* **Eval:** Every **500** steps (20 iters)
* **Checkpointing:** Every **1000** steps (sharded; full optimizer/state resume)
#### Speeds, Sizes, Times
* **Per-GPU micro-batch:** 4
* **Grad accumulation:** 8
* **World size:** 480 GPUs
* **Effective global batch:** `4 × 8 × 480 = 15,360` samples/step
* **Step time (indicative):** ~**1.5 s/step** (cluster/I-O dependent)
#### Stability & Performance
* Activation checkpointing.
* Fused kernels where available (fused attention/optimizer).
* **FlashAttention-2** on H100.
* `torch.compile` (safe mode) after warmup stability.
* Dynamic loss scaling to mitigate BF16 overflow.
* Fragmentation mitigations (e.g., `max_split_size_mb=512`, expandable segments, GC threshold ~0.8).
---
## Multi-Node GPU Setup
* **Cluster:** ~**60 nodes**, each **8× NVIDIA H100 80GB** (total **480 GPUs**), ~800 GB RAM/node.
* **Scheduler:** Slurm (priority partition, exclusive allocation, 72-hour limit).
* **Launch:** `srun` + PyTorch DDP (world size 480; ranks bound via Slurm env).
* **Storage:** Sharded checkpoints; periodic saves for robust resume.
* **Networking:** NCCL over InfiniBand with UCX
* `NCCL_IB_DISABLE=0`, `NCCL_IB_HCA="mlx5*"`, `NCCL_SOCKET_IFNAME=<ib0/enoX>`, `NCCL_BLOCKING_WAIT=1`
* Watchdog ~**720s** for fail-fast on fabric issues
* **I/O:** Async dataset prefetching; pinned FS threads.
* **Observability:** W&B + structured logs (throughput, TFLOPs/GPU, mem, step time).
* **Reproducibility:** Fixed seeds; exact launch scripts/env logged; effective tokens/step reported.
> Final checkpoint converted to **Hugging Face format** for plug-and-play inference.
---
## Evaluation
### Testing Data, Factors & Metrics
* **Testing data:** Standard academic suites (e.g., EleutherAI LM Evaluation Harness).
* **Factors:** Domain/topic (STEM vs. general), task type (multi-choice vs. open-ended).
* **Metrics:** Accuracy (MCQ), EM/F1 (QA), plus task-native metrics.
**Suggested suite (edit as applicable):**
* General knowledge & reasoning: **MMLU (STEM subsets)**, **ARC-E/ARC-C**, **HellaSwag**, **PIQA**, **Winogrande**
* Math/coding (optional): **GSM8K**, **HumanEval**
* Reading comprehension (optional): **BoolQ**, **RACE**
### Results
* *To be released with an evaluated checkpoint and harness version pin.*
Include tables with exact versions, seeds, and commit hashes.
#### Summary
* Base LM targets broad generalization at 41B tokens.
* Expect material gains after SFT + preference optimization for target tasks.
---
## Technical Specifications
### Model Architecture and Objective
* **Architecture:** Qwen3-style decoder-only Transformer
* **Parameters:** ~**1.7B**
* **Context length:** **4,096** tokens
* **Positional encoding:** *Rotary / relative (specify)*
* **Attention:** Multi-head scaled dot-product; FlashAttention-2 enabled on H100
* **Activation:** *GELU / SiLU (specify)*
* **Norms:** *RMSNorm / LayerNorm (specify)*
* **Objective:** **Causal LM** (next-token prediction)
### Compute Infrastructure
**Hardware**
* 60 nodes × 8× H100 80GB, ~800 GB RAM/node, InfiniBand fabric.
**Software**
* PyTorch ≥ 2.1 (CUDA 12.x), FlashAttention-2, UCX/NCCL
* Slurm for orchestration; W&B for logging
* (Optional) DeepSpeed/Zero-3 for training; HF conversion post-train
---
## Reproducibility (Launch Sketch)
```bash
# Slurm (illustrative)
srun -N 60 -n 480 --ntasks-per-node=8 --gpus-per-task=1 \
--cpus-per-task=8 --mem=0 \
bash -lc '
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA="mlx5*"
export NCCL_SOCKET_IFNAME=ib0
export NCCL_BLOCKING_WAIT=1
export TORCH_DISTRIBUTED_DEBUG=DETAIL
python train.py \
--model qwen3_1p7b_from_scratch \
--tokenizer qwen3 \
--data_path /path/to/arrow \
--context_length 4096 \
--optimizer adamw --weight_decay 0.01 \
--lr 2e-4 --warmup_steps 600 \
--precision bf16-mixed \
--micro_batch_size 4 \
--grad_accum_steps 8 \
--eval_every 500 --log_every 50 \
--ckpt_every 1000 \
--activation_checkpointing \
--flash_attn 2 \
--compile safe \
--seed 42
'
```
---
## Conversion & Inference
* Checkpoints are **HF-compatible**: load with `AutoModelForCausalLM`.
* For memory-limited environments, prefer half-precision or 4/8-bit loading.
* Distribute as `safetensors` for integrity.
---
## Changelog
* **v0.1 (2025-11-17):** Initial public release — 40B-token 1-epoch pretrain; HF conversion.