--- license: apache-2.0 tags: - base-model - causal-lm - qwen3 - transformer language: - en pipeline_tag: text-generation --- # QVAC Genesis I Pretrained Model ## Key Highlights - **Pretrained on the Largest Synthetic Educational Dataset** This model has been **pretrained on Tether's QVAC Genesis I**, the largest synthetic dataset released for educational LLM pre-training. The model was trained **from scratch** on approximately **40B tokens** of multi-domain educational text, using **BF16 mixed precision** and a **4,096-token context window**. Training was made with a **Qwen3-family 1.7B-parameter decoder-only transformer** architecture. Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning. - **Multi-Domain Educational Coverage** Because the model is trained on QVAC Genesis I, it inherits curriculum-aligned coverage across: - Mathematics - Physics - Biology - Medicine - **Superior Benchmark Performance** Leveraging QVAC Genesis I as its training foundation, the model consistently outperforms baselines in: - Reasoning tasks - Knowledge assessments - Subject-specific QA - **First Publicly Released Education-Specific Pretrained Model** This is the first open-source pretrained model built directly on a rigorously validated synthetic dataset for education, offering deep and comprehensive STEM coverage. abilities ## Intended Uses - Continual pre-training or fine-tuning for educational applications (STEM-focused tutoring, QA systems, curriculum support) - Benchmarking reasoning and subject-specific QA performance - Research into synthetic dataset–driven LLM training --- ## Model Details ### Model Description - **Developed by:** Qvac by Tether - **Model type:** Decoder-only Transformer (causal LM) - **Language(s) (NLP):** Primarily English - **License:** Apache-2.0 - **Finetuned from model:** **None (trained from scratch)** - **Intended stage:** **Base pre-trained model** (no SFT / RLHF alignment) ### Dataset Details - **Repository:** https://huggingface.co/qvac/genesisI-model - **Paper / Blog :** https://huggingface.co/blog/qvac/genesis-i --- ## Uses ### Direct Use - General language modeling: next-token prediction, continuation, summarization, drafting. - Research baseline for scaling, data ablations, or tokenizer studies. ### Downstream Use (recommended) - **CPT** Continued Pre-Training on more tokens. - **SFT** for assistants, domain experts, or task-specific models. - **Preference optimization / RLHF** for safer, more helpful behavior. - **Adapters/LoRA** for efficient domain specialization. ### Out-of-Scope Use - High-stakes decision-making (medical/financial/legal). - Safety-critical or autonomous control systems. - Unfiltered end-user chat deployment without alignment / safety layers. - Any use that violates applicable laws or platform policies. --- ## Bias, Risks, and Limitations - **Bias & toxicity:** May reflect or amplify biases present in web text. - **Hallucinations:** Can produce confident but incorrect statements or citations. - **Security / privacy:** May emit continous random strings. - **Context limit:** 4,096 tokens; longer inputs require chunking. ### Recommendations - Disclose limitations to downstream users. - Research Model : Not to be used in production use cases. --- ## How to Get Started ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "qvac/genesisI-model" tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, # trained with BF16 mixed precision device_map="auto" ) prompt = "Explain precision vs. recall in one paragraph." inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate( **inputs, max_new_tokens=256, do_sample=True, top_p=0.9, temperature=0.7 ) print(tok.decode(out[0], skip_special_tokens=True)) ```` *Tip: On consumer GPUs, consider loading in `float16` or using 4/8-bit quantization (e.g., bitsandbytes/AutoGPTQ).* --- ## Training Details ### Training Data * **Size:** ~**40B tokens**, single epoch. * **Domains:** Mixed general + STEM/technical sources (expository text, problem sets, references). * **Format:** Hugging Face Datasets (Arrow). * **Tokenizer:** **Qwen3** tokenizer. * **Processing:** Normalization, filtering of extremes, document chunking to fit **4096** context, sequence packing where applicable. * **Dataset Card:** *Coming Soon* ### Training Procedure #### Preprocessing * Unicode normalization, whitespace cleanup, control-char stripping. * Length filtering; chunking to 4096; optional packing to improve throughput. #### Training Hyperparameters * **Optimizer:** AdamW (β₁=0.9, β₂=0.95), **weight decay 0.01** * **Learning rate:** **2e-4** (linear warmup) * **Warmup:** **600** steps (~10% of max steps) * **Precision:** **BF16 mixed precision** * **Gradient clipping:** **1.0** * **Seed:** **42** * **Logging:** Every **50** steps * **Eval:** Every **500** steps (20 iters) * **Checkpointing:** Every **1000** steps (sharded; full optimizer/state resume) #### Speeds, Sizes, Times * **Per-GPU micro-batch:** 4 * **Grad accumulation:** 8 * **World size:** 480 GPUs * **Effective global batch:** `4 × 8 × 480 = 15,360` samples/step * **Step time (indicative):** ~**1.5 s/step** (cluster/I-O dependent) #### Stability & Performance * Activation checkpointing. * Fused kernels where available (fused attention/optimizer). * **FlashAttention-2** on H100. * `torch.compile` (safe mode) after warmup stability. * Dynamic loss scaling to mitigate BF16 overflow. * Fragmentation mitigations (e.g., `max_split_size_mb=512`, expandable segments, GC threshold ~0.8). --- ## Multi-Node GPU Setup * **Cluster:** ~**60 nodes**, each **8× NVIDIA H100 80GB** (total **480 GPUs**), ~800 GB RAM/node. * **Scheduler:** Slurm (priority partition, exclusive allocation, 72-hour limit). * **Launch:** `srun` + PyTorch DDP (world size 480; ranks bound via Slurm env). * **Storage:** Sharded checkpoints; periodic saves for robust resume. * **Networking:** NCCL over InfiniBand with UCX * `NCCL_IB_DISABLE=0`, `NCCL_IB_HCA="mlx5*"`, `NCCL_SOCKET_IFNAME=`, `NCCL_BLOCKING_WAIT=1` * Watchdog ~**720s** for fail-fast on fabric issues * **I/O:** Async dataset prefetching; pinned FS threads. * **Observability:** W&B + structured logs (throughput, TFLOPs/GPU, mem, step time). * **Reproducibility:** Fixed seeds; exact launch scripts/env logged; effective tokens/step reported. > Final checkpoint converted to **Hugging Face format** for plug-and-play inference. --- ## Evaluation ### Testing Data, Factors & Metrics * **Testing data:** Standard academic suites (e.g., EleutherAI LM Evaluation Harness). * **Factors:** Domain/topic (STEM vs. general), task type (multi-choice vs. open-ended). * **Metrics:** Accuracy (MCQ), EM/F1 (QA), plus task-native metrics. **Suggested suite (edit as applicable):** * General knowledge & reasoning: **MMLU (STEM subsets)**, **ARC-E/ARC-C**, **HellaSwag**, **PIQA**, **Winogrande** * Math/coding (optional): **GSM8K**, **HumanEval** * Reading comprehension (optional): **BoolQ**, **RACE** ### Results * *To be released with an evaluated checkpoint and harness version pin.* Include tables with exact versions, seeds, and commit hashes. #### Summary * Base LM targets broad generalization at 41B tokens. * Expect material gains after SFT + preference optimization for target tasks. --- ## Technical Specifications ### Model Architecture and Objective * **Architecture:** Qwen3-style decoder-only Transformer * **Parameters:** ~**1.7B** * **Context length:** **4,096** tokens * **Positional encoding:** *Rotary / relative (specify)* * **Attention:** Multi-head scaled dot-product; FlashAttention-2 enabled on H100 * **Activation:** *GELU / SiLU (specify)* * **Norms:** *RMSNorm / LayerNorm (specify)* * **Objective:** **Causal LM** (next-token prediction) ### Compute Infrastructure **Hardware** * 60 nodes × 8× H100 80GB, ~800 GB RAM/node, InfiniBand fabric. **Software** * PyTorch ≥ 2.1 (CUDA 12.x), FlashAttention-2, UCX/NCCL * Slurm for orchestration; W&B for logging * (Optional) DeepSpeed/Zero-3 for training; HF conversion post-train --- ## Reproducibility (Launch Sketch) ```bash # Slurm (illustrative) srun -N 60 -n 480 --ntasks-per-node=8 --gpus-per-task=1 \ --cpus-per-task=8 --mem=0 \ bash -lc ' export NCCL_IB_DISABLE=0 export NCCL_IB_HCA="mlx5*" export NCCL_SOCKET_IFNAME=ib0 export NCCL_BLOCKING_WAIT=1 export TORCH_DISTRIBUTED_DEBUG=DETAIL python train.py \ --model qwen3_1p7b_from_scratch \ --tokenizer qwen3 \ --data_path /path/to/arrow \ --context_length 4096 \ --optimizer adamw --weight_decay 0.01 \ --lr 2e-4 --warmup_steps 600 \ --precision bf16-mixed \ --micro_batch_size 4 \ --grad_accum_steps 8 \ --eval_every 500 --log_every 50 \ --ckpt_every 1000 \ --activation_checkpointing \ --flash_attn 2 \ --compile safe \ --seed 42 ' ``` --- ## Conversion & Inference * Checkpoints are **HF-compatible**: load with `AutoModelForCausalLM`. * For memory-limited environments, prefer half-precision or 4/8-bit loading. * Distribute as `safetensors` for integrity. --- ## Changelog * **v0.1 (2025-11-17):** Initial public release — 40B-token 1-epoch pretrain; HF conversion.