File size: 9,533 Bytes

---
license: apache-2.0
tags:
- base-model
- causal-lm
- qwen3
- transformer
language:
- en
pipeline_tag: text-generation
---

# QVAC Genesis I Pretrained Model

## Key Highlights
- **Pretrained on the Largest Synthetic Educational Dataset**  
  This model has been **pretrained on Tether's QVAC Genesis I**, the largest synthetic dataset released for educational LLM pre-training.

  The model was trained **from scratch** on approximately **40B tokens** of multi-domain educational text, using **BF16 mixed precision** and a **4,096-token context window**. Training was made with a **Qwen3-family 1.7B-parameter decoder-only transformer** architecture.

  Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning.

- **Multi-Domain Educational Coverage**  
  Because the model is trained on QVAC Genesis I, it inherits curriculum-aligned coverage across:  
  - Mathematics  
  - Physics  
  - Biology  
  - Medicine  

- **Superior Benchmark Performance**  
  Leveraging QVAC Genesis I as its training foundation, the model consistently outperforms baselines in:  
  - Reasoning tasks  
  - Knowledge assessments  
  - Subject-specific QA  

- **First Publicly Released Education-Specific Pretrained Model**  
  This is the first open-source pretrained model built directly on a rigorously validated synthetic dataset for education, offering deep and comprehensive STEM coverage.
abilities  

## Intended Uses
- Continual pre-training or fine-tuning for educational applications (STEM-focused tutoring, QA systems, curriculum support)  
- Benchmarking reasoning and subject-specific QA performance  
- Research into synthetic dataset–driven LLM training  

---

## Model Details

### Model Description

- **Developed by:** Qvac by Tether
- **Model type:** Decoder-only Transformer (causal LM)
- **Language(s) (NLP):** Primarily English 
- **License:** Apache-2.0
- **Finetuned from model:** **None (trained from scratch)**
- **Intended stage:** **Base pre-trained model** (no SFT / RLHF alignment)

### Dataset Details

- **Repository:** https://huggingface.co/qvac/genesisI-model
- **Paper / Blog :** https://huggingface.co/blog/qvac/genesis-i

---

## Uses

### Direct Use

- General language modeling: next-token prediction, continuation, summarization, drafting.
- Research baseline for scaling, data ablations, or tokenizer studies.

### Downstream Use (recommended)
- **CPT** Continued Pre-Training on more tokens.
- **SFT** for assistants, domain experts, or task-specific models.
- **Preference optimization / RLHF** for safer, more helpful behavior.
- **Adapters/LoRA** for efficient domain specialization.

### Out-of-Scope Use

- High-stakes decision-making (medical/financial/legal).
- Safety-critical or autonomous control systems.
- Unfiltered end-user chat deployment without alignment / safety layers.
- Any use that violates applicable laws or platform policies.

---

## Bias, Risks, and Limitations

- **Bias & toxicity:** May reflect or amplify biases present in web text.
- **Hallucinations:** Can produce confident but incorrect statements or citations.
- **Security / privacy:** May emit continous random strings.
- **Context limit:** 4,096 tokens; longer inputs require chunking.

### Recommendations

- Disclose limitations to downstream users.
- Research Model : Not to be used in production use cases. 

---

## How to Get Started

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "qvac/genesisI-model"

tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,   # trained with BF16 mixed precision
    device_map="auto"
)

prompt = "Explain precision vs. recall in one paragraph."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    top_p=0.9,
    temperature=0.7
)
print(tok.decode(out[0], skip_special_tokens=True))
````

*Tip: On consumer GPUs, consider loading in `float16` or using 4/8-bit quantization (e.g., bitsandbytes/AutoGPTQ).*

---

## Training Details

### Training Data

* **Size:** ~**40B tokens**, single epoch.
* **Domains:** Mixed general + STEM/technical sources (expository text, problem sets, references).
* **Format:** Hugging Face Datasets (Arrow).
* **Tokenizer:** **Qwen3** tokenizer.
* **Processing:** Normalization, filtering of extremes, document chunking to fit **4096** context, sequence packing where applicable.
* **Dataset Card:** *Coming Soon*

### Training Procedure

#### Preprocessing

* Unicode normalization, whitespace cleanup, control-char stripping.
* Length filtering; chunking to 4096; optional packing to improve throughput.

#### Training Hyperparameters

* **Optimizer:** AdamW (β₁=0.9, β₂=0.95), **weight decay 0.01**
* **Learning rate:** **2e-4** (linear warmup)
* **Warmup:** **600** steps (~10% of max steps)
* **Precision:** **BF16 mixed precision**
* **Gradient clipping:** **1.0**
* **Seed:** **42**
* **Logging:** Every **50** steps
* **Eval:** Every **500** steps (20 iters)
* **Checkpointing:** Every **1000** steps (sharded; full optimizer/state resume)

#### Speeds, Sizes, Times

* **Per-GPU micro-batch:** 4
* **Grad accumulation:** 8
* **World size:** 480 GPUs
* **Effective global batch:** `4 × 8 × 480 = 15,360` samples/step
* **Step time (indicative):** ~**1.5 s/step** (cluster/I-O dependent)

#### Stability & Performance

* Activation checkpointing.
* Fused kernels where available (fused attention/optimizer).
* **FlashAttention-2** on H100.
* `torch.compile` (safe mode) after warmup stability.
* Dynamic loss scaling to mitigate BF16 overflow.
* Fragmentation mitigations (e.g., `max_split_size_mb=512`, expandable segments, GC threshold ~0.8).

---

## Multi-Node GPU Setup

* **Cluster:** ~**60 nodes**, each **8× NVIDIA H100 80GB** (total **480 GPUs**), ~800 GB RAM/node.
* **Scheduler:** Slurm (priority partition, exclusive allocation, 72-hour limit).
* **Launch:** `srun` + PyTorch DDP (world size 480; ranks bound via Slurm env).
* **Storage:** Sharded checkpoints; periodic saves for robust resume.
* **Networking:** NCCL over InfiniBand with UCX

  * `NCCL_IB_DISABLE=0`, `NCCL_IB_HCA="mlx5*"`, `NCCL_SOCKET_IFNAME=<ib0/enoX>`, `NCCL_BLOCKING_WAIT=1`
  * Watchdog ~**720s** for fail-fast on fabric issues
* **I/O:** Async dataset prefetching; pinned FS threads.
* **Observability:** W&B + structured logs (throughput, TFLOPs/GPU, mem, step time).
* **Reproducibility:** Fixed seeds; exact launch scripts/env logged; effective tokens/step reported.

> Final checkpoint converted to **Hugging Face format** for plug-and-play inference.

---

## Evaluation

### Testing Data, Factors & Metrics

* **Testing data:** Standard academic suites (e.g., EleutherAI LM Evaluation Harness).
* **Factors:** Domain/topic (STEM vs. general), task type (multi-choice vs. open-ended).
* **Metrics:** Accuracy (MCQ), EM/F1 (QA), plus task-native metrics.

**Suggested suite (edit as applicable):**

* General knowledge & reasoning: **MMLU (STEM subsets)**, **ARC-E/ARC-C**, **HellaSwag**, **PIQA**, **Winogrande**
* Math/coding (optional): **GSM8K**, **HumanEval**
* Reading comprehension (optional): **BoolQ**, **RACE**

### Results

* *To be released with an evaluated checkpoint and harness version pin.*
  Include tables with exact versions, seeds, and commit hashes.

#### Summary

* Base LM targets broad generalization at 41B tokens.
* Expect material gains after SFT + preference optimization for target tasks.

---

## Technical Specifications

### Model Architecture and Objective

* **Architecture:** Qwen3-style decoder-only Transformer
* **Parameters:** ~**1.7B**
* **Context length:** **4,096** tokens
* **Positional encoding:** *Rotary / relative (specify)*
* **Attention:** Multi-head scaled dot-product; FlashAttention-2 enabled on H100
* **Activation:** *GELU / SiLU (specify)*
* **Norms:** *RMSNorm / LayerNorm (specify)*
* **Objective:** **Causal LM** (next-token prediction)

### Compute Infrastructure

**Hardware**

* 60 nodes × 8× H100 80GB, ~800 GB RAM/node, InfiniBand fabric.

**Software**

* PyTorch ≥ 2.1 (CUDA 12.x), FlashAttention-2, UCX/NCCL
* Slurm for orchestration; W&B for logging
* (Optional) DeepSpeed/Zero-3 for training; HF conversion post-train

---

## Reproducibility (Launch Sketch)

```bash
# Slurm (illustrative)
srun -N 60 -n 480 --ntasks-per-node=8 --gpus-per-task=1 \
  --cpus-per-task=8 --mem=0 \
  bash -lc '
  export NCCL_IB_DISABLE=0
  export NCCL_IB_HCA="mlx5*"
  export NCCL_SOCKET_IFNAME=ib0
  export NCCL_BLOCKING_WAIT=1
  export TORCH_DISTRIBUTED_DEBUG=DETAIL

  python train.py \
    --model qwen3_1p7b_from_scratch \
    --tokenizer qwen3 \
    --data_path /path/to/arrow \
    --context_length 4096 \
    --optimizer adamw --weight_decay 0.01 \
    --lr 2e-4 --warmup_steps 600 \
    --precision bf16-mixed \
    --micro_batch_size 4 \
    --grad_accum_steps 8 \
    --eval_every 500 --log_every 50 \
    --ckpt_every 1000 \
    --activation_checkpointing \
    --flash_attn 2 \
    --compile safe \
    --seed 42
'
```

---

## Conversion & Inference

* Checkpoints are **HF-compatible**: load with `AutoModelForCausalLM`.
* For memory-limited environments, prefer half-precision or 4/8-bit loading.
* Distribute as `safetensors` for integrity.

---


## Changelog

* **v0.1 (2025-11-17):** Initial public release — 40B-token 1-epoch pretrain; HF conversion.